Skip to content

refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978

Merged
scarmuega merged 32 commits into
mainfrom
feat/shard-ewrap-work-units
Apr 29, 2026
Merged

refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978
scarmuega merged 32 commits into
mainfrom
feat/shard-ewrap-work-units

Conversation

@scarmuega
Copy link
Copy Markdown
Member

@scarmuega scarmuega commented Apr 25, 2026

Summary

Shards the per-account legs of the epoch-boundary pipeline (Ewrap, Estart, Rupd) so they no longer materialise the full account / pending-rewards namespace in memory at once, and adds first-class crash-recovery semantics so a mid-boundary restart resumes at the right shard instead of replaying.

Final architecture

Five work units (Genesis, Roll, Rupd, Ewrap, Estart) plus the ForcedStop sentinel. Three of them are sharded:

Work unit Per-shard leg finalize() (global)
Rupd reward computation against the mark snapshot; emits RupdProgress writes EpochState.incentives, clears rupd_progress
Ewrap applies pending rewards, routes unspendable rewards (treasury vs reserves), accumulates totals into EpochState.end via EWrapProgress enactment / MIR / pool-proposal refunds / wrap-up; emits EpochWrapUpV2 to close the boundary
Estart per-account snapshot rotation via AccountTransition; emits EStartProgress rotates remaining (pool / drep / proposal) snapshots, computes initial pots, emits EpochTransitionV2

Boundary order: … Roll → Rupd → Roll … → Ewrap shards → Ewrap finalize → Estart shards → Estart finalize → next epoch's Roll ….

The lifecycle now looks like initialize → start_shard..total_shards { load → compute → commit_wal → commit_state → commit_archive → commit_indexes } → finalize (crates/core/src/sync.rs::run_lifecycle). Each sharded unit caches its total_shards and start_shard in initialize from persisted *_progress fields on EpochState, so a config change between shards or a crash mid-pipeline doesn't break the in-flight boundary.

The crate::shard module (crates/cardano/src/shard.rs) owns the credential-key-prefix partitioning, validated at compile time on the ACCOUNT_SHARDS constant and at runtime against any total_shards value pulled from persisted ShardProgress.total.

Major commit clusters

  • Sharding scaffold (948d472eeaf8e08c, fd888b8e, c7fce119) — split EWRAP and ESTART into per-shard + finalize halves, then add the same shape to RUPD. WorkUnit trait grows total_shards(), initialize(), finalize().
  • Boundary nomenclature (9627c3ae, ae260792, 2294bb32, 2ff83b52, 9a80080d, 1a3722db, 8174d2a8, f84792ba, aa835bb6, 4728e21f, 1b849366) — renames + module split + persisted shard count.
  • Unification (07c3191c, 62031ea8, a36ccf67) — collapses the standalone AccountShard/AShard module into EWRAP's per-shard leg; renames EpochEndAccumulateEWrapProgress; replaces the runtime account_shards config field with the compile-time-validated ACCOUNT_SHARDS constant.
  • Correctness fixes from PR review (bb3a195a, 5122e075, 317b6766, plus the four most recent fixes below).

Recent correctness + compatibility work

  • WAL backward compatibility (39237536) — bincode encodes CardanoDelta variants by index and structs by position. Adding EWrapProgress / EStartProgress / RupdProgress mid-enum and appending prev_*_progress fields onto EpochWrapUp / EpochTransition would have broken every pre-PR WAL row. Variants are now appended to the end (indices 0..=38 frozen), and the original structs are kept verbatim under #[deprecated] while EpochWrapUpV2 / EpochTransitionV2 carry the new undo state. Pre-upgrade WAL still decodes; new commit paths emit V2.
  • Restart cursor (3e50db60) — initialize() previously read only *_progress.total, and the lifecycle loop iterated 0..total_shards unconditionally. After a crash mid-boundary, restart replayed shards 0..k-1 — unsafe because AccountTransition is non-idempotent. Adds WorkUnit::start_shard() (default 0); the three sharded units cache *_progress.committed in initialize and the lifecycle loop now runs start_shard..total_shards. Also propagates load_epoch failures via ? instead of silently falling back to defaults.
  • Archive ending_state refresh (28fe4c39) — commit_finalize was archiving the pre-EpochWrapUp snapshot because stream_and_apply_namespace::<EpochState> only mutated the writer, never the in-memory BoundaryWork.ending_state. A new EpochState-specific helper returns the post-apply singleton; commit_finalize swaps it into ending_state before the archive write.
  • Release-mode shard invariants (56b6dbb9) — shard_key_ranges previously used debug_assert! for total_shards > 0 and total_shards.is_multiple_of(256), so release builds could divide by zero or silently mis-partition on a corrupt persisted ShardProgress.total. Promoted to unconditional assert!/panic!.
  • Housekeeping (bd082462, 0719956d, 3492e5fb) — typo in genesis seeding comment, markdownlint MD040 on a fenced code block, stale BoundaryWork::load_ewrap reference, memory test now samples allocation across iteration (not just construction), clippy clean, and a sweep of the skills/debug-epoch-mismatch guide that still referenced the removed ASHARD work unit and ashard/ module paths.

Test plan

  • cargo check --workspace clean
  • cargo clippy --workspace --all-targets clean (only the unrelated chumsky future-incompat warning remains, from a transitive dep)
  • cargo test -p dolos-cardano --lib — 110 unit tests pass, including new restart_after_{estart,ewrap,rupd}_replays_*, restart_safety_no_skipped_work, epoch_boundary_emits_ewrap_then_estart, plus parallel epoch_wrap_up_v2_roundtrip / epoch_transition_v2_roundtrip proptests covering the V1/V2 split
  • cargo test --workspace --exclude dolos — full workspace test sweep green
  • cargo test --test bootstrap — crash-after-state-commit recovery (the helper now correctly skips finalize() to model the actual crash boundary)
  • cargo test --test memory — shard-range iteration stays heap-bounded across both construction and full iteration
  • cargo test --test epoch_pots --release — gold-standard end-to-end against DBSync ground truth (requires populated Mithril test instance)
  • Manual bootstrap from Mithril to confirm the genesis seeding + V2 commit paths produce the same EpochState archive as a pre-PR run

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Sharded epoch-boundary processing for per-account work plus a single global finalize pass.
  • Improvements

    • Robust mid-boundary crash recovery via persisted per-shard progress and resumable lifecycle.
    • More accurate, range-restricted reward routing and consolidated finalization for epoch snapshots.
  • Tests

    • New memory-boundedness and epoch-harness tests validating shard-aware behavior.
  • Documentation

    • Added sharded lifecycle specification and updated debugging guidance.

scarmuega and others added 6 commits April 24, 2026 17:40
Partitions the epoch-boundary EWRAP work unit — previously one monolithic
work unit that materialised O(active_accounts) in memory (rewards map,
deltas, logs, applied_rewards) — into three phase-specific work units that
each commit independently:

* EwrapPrepare: global classification (pools/dreps/proposals), MIRs,
  enactment + refund visitors for non-account entities, emits EpochEndInit
  seeding EpochState.end with the prepare-time globals and zeroed reward
  accumulators.
* EwrapShard(i): range-scoped (first-byte prefix bucket) load of pending
  rewards + accounts, runs rewards + drops visitors per account, emits
  EpochEndAccumulate with the shard's reward contribution.
* EwrapFinalize: reads the accumulated EpochState.end, emits EpochWrapUp
  (which transitions rolling/pparams snapshots and clears ewrap_progress).

Cross-shard handoff piggy-backs on EpochState rather than a new entity:
ewrap_progress: Option<u32> is the durable cursor and EpochState.end
accumulates across shards via the new deltas.

WorkBuffer gains EwrapShardingBoundary{shard_index, total_shards} and
EwrapFinaliseBoundary states; pop_work now takes ewrap_total_shards from
CardanoConfig (default 16). EpochEndAccumulate has an idempotency guard
keyed on ewrap_progress so shard re-execution after a crash is safe.

Detection-only crash recovery at initialize time logs a warning when
ewrap_progress is set; full block-rehydration resume is flagged as TODO.

Memory tests in tests/memory.rs verify both fjall and redb3 honour
range-scoped iter_entities with O(1) heap — the load-bearing property for
the shard design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n ESTART

Decouple two responsibilities that were tangled in EwrapPrepareWorkUnit:
the global epoch-boundary entity processing (now plain `Ewrap`) and the
structural opening of the `EpochState.end` slot (now done by ESTART's
`EpochTransition`). Ewrap's `EpochEndInit` delta keeps its overwrite
semantics; it now writes into a default-seeded slot rather than from
None.

Also adds `prev_end` / `prev_ewrap_progress` undo fields to
`EpochTransition` (serialized, like the other prev_* fields) so a
rollback after restart correctly restores them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-account leg of the epoch close was named after its position in the
EWRAP pipeline; AccountShard names what it actually does — apply rewards
and pool/drep delegator drops over a key-range slice of the account
namespace.

Also renames the related symbols (BoundaryWork::load_shard /
commit_shard → load_account_shard / commit_account_shard,
WorkBuffer::EwrapShardingBoundary → AccountShardingBoundary,
InternalWorkUnit::EwrapShard → AccountShard). The user-facing
`ewrap_total_shards` config field is intentionally preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line

The epoch-boundary sequence is now AccountShard ×N → Ewrap → EwrapFinalize
(was Ewrap → AccountShard ×N → EwrapFinalize). Per-account work settles
first; the global Ewrap phase then patches the prepare-time fields onto an
EpochState.end that already has its reward accumulators populated.

State machine: WorkBuffer::pop_work transitions reordered, and
on_ewrap_boundary now takes ewrap_total_shards so the restart-at-boundary
entry can construct AccountShardingBoundary directly. The
total_shards == 0 defensive branch now skips to EwrapBoundary (global
phase) instead of EwrapFinaliseBoundary.

Delta semantics:
- EpochEndInit::apply is now a PATCH — writes only the prepare-time
  fields (pool counts, epoch_incentives, MIR amounts, proposal refunds)
  and leaves the accumulator fields alone. ewrap_progress is no longer
  touched by this delta. Dropped the unused prev_ewrap_progress field.
- EpochEndAccumulate::apply treats ewrap_progress = None as the natural
  starting state for shard 0 (unwrap_or(0) as the expected cursor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…delta

The boundary close is now a single Ewrap work unit: it runs the global
visitors AND emits EpochWrapUp carrying the assembled final EndStats
(prepare-time fields combined with the AccountShard-populated accumulator
fields). The wrap-up visitor now constructs the final stats locally
instead of routing them through a separate EpochEndInit delta.

Side-benefits: one fewer state-machine state, one fewer delta type, one
fewer commit cycle. Atomicity also improves — the boundary close is now
a single state-writer commit, so a crash between Ewrap and EwrapFinalize
is no longer possible.

Test fixture in tests/epoch_pots/main.rs restructured to match the
post-reorder pipeline: accumulator reset gates on AccountShard
shard_index == 0; rewards CSV is dumped on the Ewrap arm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the boundary-pipeline reorder (AccountShard runs before Ewrap),
the first epoch's AccountShard hits `EpochEndAccumulate::apply` with
`entity.end == None` because Genesis bootstrapped the EpochState before
ESTART's `EpochTransition` had a chance to seed the slot. Seed
`end = Some(EndStats::default())` directly in Genesis to match the
invariant ESTART maintains for every subsequent epoch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Splits epoch-boundary work into per-shard and global finalize phases for Ewrap, Estart, and RUPD; adds shard partitioning, persisted shard progress cursors, sharded loading/commit APIs, and a centralized lifecycle runner (initialize → per-shard loop → finalize) to enable resumable, memory-bounded boundary processing.

Changes

Cohort / File(s) Summary
Sharding core
crates/cardano/src/shard.rs
New module that partitions CBOR-derived credential keys into lexicographic shard ranges; exposes shard constants, validation, and shard_key_ranges() with rollover logic and unit tests.
WorkUnit lifecycle / orchestration
crates/core/src/work_unit.rs, crates/core/src/sync.rs, crates/core/src/import.rs
WorkUnit trait extended with total_shards() and initialize(); per-phase methods now accept shard_index; finalize() added. run_lifecycle executes initialize → per-shard loop → finalize; execute/import delegate to it.
Ewrap (close-half) sharding
crates/cardano/src/ewrap/loading.rs, crates/cardano/src/ewrap/commit.rs, crates/cardano/src/ewrap/work_unit.rs, crates/cardano/src/ewrap/mod.rs, crates/cardano/src/ewrap/wrapup.rs, crates/cardano/src/ewrap/rewards.rs
Reworks Ewrap into load_shard/compute_shard_deltas/commit_shard for per-account ranges and load_finalize/commit_finalize for global MIR/refund/wrap-up; makes namespace streaming helper crate-visible with optional Range<EntityKey>; adds shard-local accumulators and adjusts wrap-up emission to V2.
Estart (open-half) sharding
crates/cardano/src/estart/loading.rs, crates/cardano/src/estart/commit.rs, crates/cardano/src/estart/work_unit.rs, crates/cardano/src/estart/reset.rs, crates/cardano/src/estart/mod.rs
Mirrors Ewrap split: per-shard account snapshot rotation via load_shard/commit_shard, global finalize via load_finalize/compute_global_deltas/commit_finalize; BoundaryVisitor no longer buffers deltas; emit_epoch_transition() added; streaming helper gains optional range.
RUPD sharded pipeline
crates/cardano/src/rupd/loading.rs, crates/cardano/src/rupd/mod.rs, crates/cardano/src/rupd/work_unit.rs
Stake snapshot split into load_globals (pool-level) and merge_shard (per-account ranges); per-shard commits write shard-scoped PendingRewardState and guarded RupdProgress; archive writing deferred to finalize() with pool_log_shares aggregation; shard_ranges and reward gating added.
Epoch model & deltas
crates/cardano/src/model/epochs.rs, crates/cardano/src/model/mod.rs, crates/cardano/src/model/pending.rs
Adds persisted shard progress cursors (ShardProgress) and three progress deltas (EWrapProgress, EStartProgress, RupdProgress), V2 epoch wrap/transition types with undo semantics, EndStats::default(), and doc tweaks for pending states.
Work unit implementations & wiring
crates/cardano/src/lib.rs, crates/cardano/src/work.rs, crates/cardano/src/ewrap/work_unit.rs, crates/cardano/src/estart/work_unit.rs, crates/cardano/src/genesis/work_unit.rs, crates/cardano/src/roll/work_unit.rs, crates/cardano/src/rewards/mod.rs
Adapts Cardano work units to shard-aware lifecycle: removal of config from some constructors, total_shards resolution in initialize, per-shard load/commit usage, RewardsContext::should_include added to gate reward emission.
BoundaryWork / WorkContext changes
crates/cardano/src/ewrap/mod.rs, crates/cardano/src/ewrap/commit.rs, crates/cardano/src/estart/commit.rs, crates/cardano/src/ewrap/loading.rs
BoundaryWork split into shard vs finalize variants; stream_and_apply_namespace made pub(crate) and accepts optional key-range; new commit_shard and commit_finalize (and corresponding load_shard/load_finalize) APIs introduced.
Archive/commit semantics
crates/cardano/src/ewrap/commit.rs, crates/cardano/src/estart/commit.rs, crates/cardano/src/rupd/work_unit.rs
Per-shard archive logs written under shard/epoch temporal keys and committed atomically with state; global archive writes and finalized epoch snapshots moved to finalize phase; commit methods adjusted to warn on uncommitted deltas.
Sharness/tests & memory
crates/testing/src/harness/cardano.rs, tests/epoch_pots/main.rs, tests/memory.rs, tests/bootstrap.rs
Harness callbacks extended with shard_index and receive final post-finalize sentinel; epoch test aggregates applied rewards across shards and dumps once; new memory-boundedness tests validate iterating a single shard-range is low-memory.
Docs & tooling
crates/cardano/work_units.md, skills/debug-epoch-mismatch/SKILL.md, crates/cardano/src/estart/mod.rs, crates/core/src/lib.rs
New work-units specification and debugging doc updates (ASHARD vs EWRAP distinctions); ChainError::InvalidConfig(String) added; module-level docs updated to reflect sharded/finalize workflow.

Sequence Diagram(s)

sequenceDiagram
    participant Sync as sync::run_lifecycle
    participant WU as WorkUnit (Ewrap/Estart/Rupd)
    participant State as StateStore (persist)
    participant Archive as ArchiveStore

    Sync->>WU: initialize(domain)
    Note over WU: determine total_shards, init globals
    loop for shard_index in 0..total_shards-1
        Sync->>WU: load(domain, shard_index)
        Note over WU: load_shard(ranges): pending rewards/snapshot for ranges
        Sync->>WU: compute(shard_index)
        Note over WU: compute_shard_deltas(): apply per-account visitors
        Sync->>WU: commit_wal(domain, shard_index) [opt]
        Sync->>WU: commit_state(domain, shard_index)
        Note over WU: commit_shard: write shard-scoped entities, apply progress delta
        WU->>State: write shard state (PendingRewardState, deltas)
        WU->>Archive: write shard archive logs
        Sync->>WU: commit_archive(domain, shard_index)
        Sync->>WU: commit_indexes(domain, shard_index)
    end
    Sync->>WU: finalize(domain)
    Note over WU: load_finalize + commit_finalize: process MIRs/refunds, write final archive, advance/clear cursors
    WU->>State: write final epoch snapshot & global entities
    WU->>Archive: write finalized archive logs
Loading
sequenceDiagram
    participant Shard as Per-shard processing
    participant Progress as EpochState Progress
    participant Global as Finalize pass

    Shard->>Shard: load_shard(ranges)
    Shard->>Shard: compute_shard_deltas() (apply rewards/drops)
    Shard->>Progress: apply(EWrapProgress / RupdProgress)
    Shard->>Shard: commit_shard (state + archive)
    Note over Shard,Progress: shard accumulators added into EpochState.end
    Global->>Global: load_finalize()
    Global->>Global: compute_global_deltas() (MIRs/refunds/wrapup)
    Global->>Progress: apply(EpochWrapUpV2 / EpochTransitionV2)
    Global->>Shard: clear or verify progress cursors
    Global->>Shard: commit_finalize (final archive + snapshot)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰
Shards nibble keys in tidy rows,
Per-slice the carrot of reward grows,
Finalize hops in, seals the plot,
Cursors saved — resume the spot,
A rabbit cheers: whole boundary, not!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main refactoring: splitting EWRAP into sharded phases and consolidating the boundary pipeline.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/shard-ewrap-work-units

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

scarmuega and others added 6 commits April 25, 2026 11:51
Pulls the AccountShard work unit out of `ewrap/` into a peer module
`ashard/` (matching the layout of `estart/`, `rupd/`, `roll/`,
`genesis/`). The shared `BoundaryWork` / `BoundaryVisitor` infrastructure
and the drops visitor (used by both phases) stay in `ewrap/`; `ashard/`
imports them.

Moves: `rewards.rs`, `shard.rs`, `AccountShardWorkUnit` (from
`work_unit.rs`), and the `load_*` / `commit_*` impl blocks. Visibility
on shared `BoundaryWork` helpers (`new_empty`, `load_pool_data`,
`load_drep_data`, `stream_and_apply_namespace`) widened from private to
`pub(crate)`. The `ending_state` field also widened to `pub(crate)` so
peer modules can mutate it (e.g. `wrapup.flush` already does this).

Method/identifier renames to match the new module path:
- `BoundaryWork::load_account_shard` → `load_ashard`
- `BoundaryWork::commit_account_shard` → `commit_ashard`
- `WorkUnit::name()` returns `"ashard"`

Type name `AccountShardWorkUnit` is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mments

Sweeps the docstrings/comments touched in this PR for references to
phases, work units, and deltas that no longer exist after the rename /
reorder / merge / split sequence:

- Restore the in-place explanation for the "rewards before drops"
  HACK in `ashard/loading.rs` (the dangling "see comment on the
  pre-shard path" pointed to a comment that was deleted when the
  prepare phase was removed).
- Drop "prepare phase" / "finalize phase" wording from `BoundaryWork`
  field docstrings, `commit_ewrap` comments, and `loading.rs` section
  dividers — neither phase exists; there's only Ewrap (global + close)
  and AccountShard (per-account).
- Update the ESTART `EpochTransition` description in `work_units.md`
  so it reflects the post-merge data flow: AccountShards populate the
  accumulators directly, then Ewrap reads them back and emits
  `EpochWrapUp` with the final `EndStats` (no `EpochEndInit` patch step
  anymore).
- Rename `compute_prepare_deltas` → `compute_ewrap_deltas`. The
  "prepare" name was a leftover from the `EwrapPrepare` work unit; the
  method is now the only Ewrap-phase compute helper.
- Tighten `load_pending_rewards_range` docstring; flag that the `None`
  branch is currently unused.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shard-related identifiers and comments were named after the legacy
EWRAP pipeline that bundled the global epoch-boundary work and the
per-account shards together. With AccountShard now a distinct work unit
in its own module, those names are misleading. Rename to use the
`ashard` prefix consistently with the module path:

- `CardanoConfig::ewrap_total_shards` → `ashard_total`
- `CardanoConfig::DEFAULT_EWRAP_TOTAL_SHARDS` → `DEFAULT_ASHARD_TOTAL`
- `EpochState::ewrap_progress` → `ashard_progress`
- `prev_ewrap_progress` → `prev_ashard_progress` on `EpochEndAccumulate`,
  `EpochWrapUp`, and `EpochTransition`
- `WorkBuffer::receive_block` / `on_ewrap_boundary` / `pop_work`
  parameter `ewrap_total_shards` → `ashard_total`
- Error messages in `ashard/shard.rs` updated to match.

Also fixes comment / doc misattributions where "EWRAP" was used for
work that's now in `AccountShard`:
- `PendingRewardState` / `DequeueReward` are consumed by `AccountShard`,
  not Ewrap.
- `PendingMirState` / `DequeueMir` are consumed by Ewrap (clarified).
- `AppliedReward` and the `applied_rewards` field are populated during
  AccountShard, not Ewrap.
- RUPD's docstring now says rewards are consumed by `AccountShard`.
- Crash-recovery wording in `lib.rs` says "mid-boundary" instead of
  "mid-EWRAP" since the cursor specifically tracks AccountShard
  progress.

BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set
`ewrap_total_shards` need to rename the key to `ashard_total`. Users
relying on the default (omitted) are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the type and variant names with the module path convention:

- struct `AccountShardWorkUnit` → `AShardWorkUnit`
- enum variant `CardanoWorkUnit::AccountShard` → `AShard`
- enum variant `InternalWorkUnit::AccountShard` → `AShard`
- WorkBuffer state `AccountShardingBoundary` → `AShardingBoundary`
- module re-export and all callers updated to match
- prose / docstrings / log messages also use `AShard` consistently

The module path is `crate::ashard`, so the type now reads as
`ashard::AShardWorkUnit`. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review feedback: the user-facing config name should be self-explanatory
in `dolos.toml`. Renames everywhere for consistency:

- `CardanoConfig::ashard_total` field → `account_shards`
- `CardanoConfig::ashard_total()` accessor → `account_shards()`
- `CardanoConfig::DEFAULT_ASHARD_TOTAL` → `DEFAULT_ACCOUNT_SHARDS`
- WorkBuffer parameters and error messages updated to match.

BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set
this option (under any prior name from this PR) need to use
`account_shards`. Users relying on the default are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guards against a config change to `account_shards` corrupting an
in-flight boundary. Previously, if dolos crashed mid-boundary and the
operator changed `account_shards` between crash and restart, the resume
would re-partition the account key space with the new count, mismatching
the cursor's already-committed shards.

Fix: snapshot the boundary's shard count into state at the first
`EpochEndAccumulate` apply. The persisted total is authoritative for the
duration of the in-flight boundary; the new config value only takes
effect on the next boundary.

Changes:
- New `AShardProgress { committed, total }` struct stored at
  `EpochState.ashard_progress: Option<AShardProgress>` (was
  `Option<u32>`).
- `EpochEndAccumulate` carries `total_shards`. Its apply validates the
  delta's `total_shards` matches any previously persisted total and
  surfaces an error if they diverge (would only happen if a work unit
  was constructed with a stale config view).
- `EpochWrapUp` and `EpochTransition` undo fields adapted to the new
  type.
- `AShardWorkUnit::load` / `commit_state` read the persisted total when
  present and fall back to `config.account_shards()` for fresh
  boundaries.
- `CardanoLogic` caches `effective_account_shards` (= persisted total
  if a boundary is in flight, else config). Refreshed at every
  `pop_work` call so `receive_block` (which has no state access) can
  use the up-to-date value when constructing
  `WorkBuffer::AShardingBoundary`.
- Crash-recovery wording updated to surface a clear warning when the
  persisted total disagrees with current config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@scarmuega scarmuega marked this pull request as ready for review April 26, 2026 13:46
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/cardano/src/ashard/rewards.rs (1)

114-188: ⚠️ Potential issue | 🟡 Minor

Log message clarity: add shard context to per-shard telemetry.

The flush method (lines 114–188) now runs once per AShard, but the telemetry messages ("rewards remaining before drain", "SPENDABLE REWARDS LEFT UNPROCESSED", etc.) don't indicate shard scope. Per-shard loading correctly isolates each shard's RewardMap via load_pending_rewards_range(state, Some(range)), so the drain and error detection are functionally sound. However, for operational visibility, these log statements should include shard index and total_shards to clarify they reflect per-shard metrics, not boundary-wide state. This is especially helpful for debugging when multiple shards report metrics in parallel.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/rewards.rs` around lines 114 - 188, The per-shard
logs in flush don't indicate which shard they belong to; update the flush
logging to include shard context (shard_index and total_shards). Add
shard_index: usize and total_shards: usize as fields on the AShard struct (or
otherwise make them available on self or ctx), ensure any AShard
construction/population passes these values, and then include %shard_index and
%total_shards in the tracing::debug! and tracing::error! invocations (and the
debug! at the end) around ctx.rewards/ drain_unspendable/
applied_reward_credentials so all per-shard telemetry shows shard scope.
crates/cardano/src/model/epochs.rs (1)

980-981: ⚠️ Potential issue | 🟡 Minor

Fix the unused_doc_comments rustdoc warning flagged by CI.

Pipeline reports rustdoc does not generate documentation for macro invocations on these lines. The /// comment immediately above the prop_compose! macro invocation is dropped on the floor by rustdoc and triggers the lint. Convert to a regular // comment (or move it inside the closure body if you want it preserved).

🛠️ Proposed fix
-    /// `EpochStatsUpdate::apply` calls `rolling.live_mut` which asserts `next` is None,
-    /// so we need a specialized generator that keeps `rolling.next` empty.
+    // `EpochStatsUpdate::apply` calls `rolling.live_mut` which asserts `next` is None,
+    // so we need a specialized generator that keeps `rolling.next` empty.
     prop_compose! {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/model/epochs.rs` around lines 980 - 981, The rustdoc
warning comes from using a /// doc comment immediately above the prop_compose!
macro (related to EpochStatsUpdate::apply and rolling.live_mut which asserts
rolling.next is None), so replace that leading /// comment with a non-doc
comment (//) or move the explanation into the generator body/closure so rustdoc
won't try to document the macro invocation; specifically update the comment
above the prop_compose! invocation that references rolling.live_mut and
rolling.next to use // (or relocate it inside the prop_compose! closure) to
silence the unused_doc_comments lint.
🧹 Nitpick comments (12)
crates/cardano/src/ashard/rewards.rs (1)

56-77: Stale EWRAP reference in comment.

The comment at lines 59–61 still says "registered at RUPD startStep time but deregistered before EWRAP". After this refactor the per-account registration check happens during AShard, not Ewrap. Suggest updating to "deregistered before AShard" (or "before the boundary's per-account leg") to keep the comment in lockstep with the renamed pipeline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/rewards.rs` around lines 56 - 77, Update the stale
comment referencing "EWRAP" to reflect the new pipeline stage: change the text
around the account deregistration explanation to say "deregistered before
AShard" (or "before the boundary's per-account leg") so it matches where the
per-account registration check now runs; the relevant area to edit is the
comment above the account.is_registered() branch in the function that calls
reward.total_value() and ctx.rewards.return_reward_to_treasury(total). Ensure
the updated comment keeps the existing clarification about pre-filtered accounts
and returned_rewards behavior.
crates/cardano/src/ashard/loading.rs (1)

19-57: Optional: drop the Option<Range<EntityKey>> wrapper.

The doc-comment already states load_ashard is the only caller and always passes Some(range). Making the parameter Range<EntityKey> directly removes a never-taken branch and tightens the contract. If a future "load all" caller appears, reintroducing the option is trivial.

♻️ Suggested refactor
-    fn load_pending_rewards_range<D: Domain>(
-        &mut self,
-        state: &D::State,
-        range: Option<Range<EntityKey>>,
-    ) -> Result<(), ChainError> {
-        let pending_iter = state
-            .iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, range)?;
+    fn load_pending_rewards_range<D: Domain>(
+        &mut self,
+        state: &D::State,
+        range: Range<EntityKey>,
+    ) -> Result<(), ChainError> {
+        let pending_iter = state
+            .iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, Some(range))?;

And drop the Some(range.clone()) at the call site (line 81) to pass range.clone() directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/loading.rs` around lines 19 - 57, The function
load_pending_rewards_range currently accepts Option<Range<EntityKey>> but the
only caller (load_ashard) always passes Some(range), so tighten the signature to
take Range<EntityKey> directly: change the parameter type in
load_pending_rewards_range and update callers (e.g., in load_ashard remove
Some(range.clone()) and pass range.clone() directly); ensure the body drops any
pattern-matching for None (no-op) and uses the range value when calling
state.iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, range).
tests/epoch_pots/main.rs (1)

500-513: Optional: collapse the two if let Some(boundary) checks.

The shard-index==0 reset and the per-shard extend can share a single boundary binding, which trims a branch and makes the relationship between "capture ending epoch" and "extend rewards" obvious.

♻️ Suggested refactor
                 CardanoWorkUnit::AShard(shard) => {
-                    if shard.shard_index() == 0 {
-                        // First shard of this boundary — reset accumulator
-                        // and capture the ending epoch.
-                        accumulated_applied.clear();
-                        if let Some(boundary) = shard.boundary() {
-                            accumulated_ending_epoch =
-                                Some(boundary.ending_state().number);
-                        }
-                    }
-                    if let Some(boundary) = shard.boundary() {
+                    if let Some(boundary) = shard.boundary() {
+                        if shard.shard_index() == 0 {
+                            // First shard of this boundary — reset
+                            // accumulator and capture the ending epoch.
+                            accumulated_applied.clear();
+                            accumulated_ending_epoch =
+                                Some(boundary.ending_state().number);
+                        }
                         accumulated_applied.extend(boundary.applied_rewards.iter().cloned());
                     }
                 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/epoch_pots/main.rs` around lines 500 - 513, Collapse the two separate
`if let Some(boundary)` checks into a single binding for `boundary` when
matching `CardanoWorkUnit::AShard(shard)`: inside one `if let Some(boundary) =
shard.boundary()` block first check `if shard.shard_index() == 0` to clear
`accumulated_applied` and set `accumulated_ending_epoch =
Some(boundary.ending_state().number)`, then unconditionally call
`accumulated_applied.extend(boundary.applied_rewards.iter().cloned())`; this
removes the duplicated boundary lookup and makes the reset-and-extend logic
contiguous.
crates/cardano/work_units.md (1)

5-11: Add a language to the fenced block (MD040).

markdownlint-cli2 flagged this fence as missing a language. Use text (or any non-rendered identifier) so the diagram passes lint.

📝 Suggested fix
-```
+```text
 Estart  →  Roll …  →  Rupd  →  Roll …  →  AShard ×N  →  Ewrap
 (open)     (blocks)   (RUPD)   (blocks)    (per-account)         (global + close)
                                                                           │
                                                                           ▼
                                                                   next epoch's Estart
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @crates/cardano/work_units.md around lines 5 - 11, The fenced code block
containing the epoch diagram starting with "Estart → Roll … → Rupd → Roll
… → AShard ×N → Ewrap" is missing a language tag (MD040); fix it by adding a
language identifier (use "text") immediately after the opening triple backticks
so the block becomes ```text and the diagram passes markdownlint-cli2
validation.


</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/lib.rs (2)</summary><blockquote>

`284-341`: **Consolidate duplicated `effective_account_shards` lookup logic.**

The inline computation at lines 338–341 is identical to the new helper `read_effective_account_shards` (lines 253–258). On top of that, `load_epoch::<D>(state)` is invoked twice within `initialize` (line 292 for the warnings, line 338 to derive `effective_account_shards`). A small refactor avoids the double read and keeps the “persisted total else config” rule in one place.



<details>
<summary>♻️ Proposed refactor</summary>

Lift the helper to a free fn and reuse it (and reuse the loaded `EpochState` for the warnings):

```diff
-    /// Compute the effective `account_shards` value: stored
-    /// `ashard_progress.total` if a boundary is in flight, otherwise the
-    /// configured value.
-    fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 {
-        load_epoch::<D>(state)
-            .ok()
-            .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
-            .unwrap_or_else(|| self.config.account_shards())
-    }
+    /// Compute the effective `account_shards` value: stored
+    /// `ashard_progress.total` if a boundary is in flight, otherwise the
+    /// configured value.
+    fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 {
+        effective_account_shards_from_state::<D>(state, &self.config)
+    }
+}
+
+fn effective_account_shards_from_state<D: Domain>(state: &D::State, config: &CardanoConfig) -> u32 {
+    load_epoch::<D>(state)
+        .ok()
+        .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
+        .unwrap_or_else(|| config.account_shards())
 }
```

And in `initialize`, compute once and reuse the loaded epoch for both the warnings and the cached value.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 284 - 341, Extract and reuse the
epoch load and the effective-account-shards logic: in initialize(), call
load_epoch::<D>(state) once and store the resulting EpochState (if Ok) for the
warnings; replace the inline computation of effective_account_shards with a call
to the existing helper read_effective_account_shards (or lift that helper to a
free fn if necessary) so the "persisted total else config" rule is implemented
in one place; update references to use the single loaded epoch value for the
tracing::warn! blocks and derive effective_account_shards from
read_effective_account_shards(state, &config) (or equivalent) to remove the
duplicate load_epoch::<D>(state) call.
```

</details>

---

`404-408`: **Per-`pop_work` state read may be unnecessary outside boundaries.**

`effective_account_shards` is refreshed on every `pop_work` call (each block cycle) via a `read_entity_typed` lookup, but it only matters at epoch boundaries. The cost is small (single point read), but you could skip the read when the cached value is non-stale by only refreshing when `ashard_progress` could have changed (i.e., before an actual boundary handoff). Worth considering if `pop_work` shows up in profiling.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 404 - 408, The code refreshes
self.effective_account_shards on every pop_work by calling
read_effective_account_shards::<D>(domain.state()), but this value only matters
when the epoch boundary handoff (EpochState.ashard_progress.total) can change;
avoid the repeated read by first checking whether ashard_progress has changed
and only calling read_effective_account_shards when it has. Add a small cached
marker on the struct (e.g., last_ashard_total or last_ashard_progress) and in
pop_work compare domain.state().ashard_progress.total (or the appropriate
accessor) to that cached value; if different, call read_effective_account_shards
and update both effective_account_shards and the cached marker, otherwise skip
the read and keep the existing effective_account_shards.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/ewrap/commit.rs (1)</summary><blockquote>

`91-100`: **Full-table account scan to apply a handful of MIR rewards.**

`stream_and_apply_namespace::<D, AccountState>(state, &writer, None)` walks every account in state to surface the small set with MIR-queued deltas. The inline comment acknowledges this is "effectively a targeted write via the streaming path", but on mainnet-sized account tables this is a significant cost paid every boundary, regardless of how few MIRs are pending (often zero).

Consider a direct path: iterate `self.applied_mir_credentials` (or the credentials of the queued `AssignRewards` deltas), point-read each `AccountState`, apply, and write. The streaming helper can stay for the full-set namespaces (Pool/DRep/Proposal/EpochState), but accounts during Ewrap are inherently sparse.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ewrap/commit.rs` around lines 91 - 100, The current call
to stream_and_apply_namespace::<D, AccountState>(state, &writer, None) performs
a full-table scan of AccountState each boundary; instead implement a targeted
path that iterates over self.applied_mir_credentials (or the credentials from
queued AssignRewards deltas), point-reads each AccountState, applies the MIR
delta, and writes the result via the same writer so only affected accounts are
touched; keep stream_and_apply_namespace for full-set namespaces like
Pool/DRep/Proposal/EpochState, and ensure the new code reuses the existing
apply/serialize logic used by stream_and_apply_namespace to maintain invariants
and error handling.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/ashard/work_unit.rs (1)</summary><blockquote>

`62-129`: **Avoid recomputing `total_shards`/`range` from state in `commit_state`.**

`load` computes `total_shards` and `range` from persisted state, then calls `BoundaryWork::load_ashard(..., shard_index, total_shards, range)` — so the boundary already owns these values. `commit_state` repeats the same state read and recomputation, with two minor concerns:

1. **Code duplication**: identical 8-line block in both methods (and a third copy in `crates/cardano/src/lib.rs`).
2. **Subtle drift risk**: if `commit_state` ever observed a different `ashard_progress.total` than `load` did (e.g., a shard's apply landing between the two phases of the same work unit), the recomputed `range` would diverge from what was loaded. Today this can't happen, but cross-phase reuse of the captured values is more obviously correct.

Consider exposing the captured shard info from `BoundaryWork` (or storing the computed `range` in `AShardWorkUnit` after `load`) and have `commit_state` reuse it instead of re-deriving.



<details>
<summary>♻️ Sketch</summary>

```diff
 pub struct AShardWorkUnit {
     slot: BlockSlot,
     config: CardanoConfig,
     genesis: Arc<Genesis>,
     shard_index: u32,
-    boundary: Option<BoundaryWork>,
+    boundary: Option<BoundaryWork>,
+    /// Captured during `load` so `commit_state` reuses the exact same range.
+    range: Option<std::ops::Range<dolos_core::EntityKey>>,
 }
```

Then `load` stores `self.range = Some(range.clone())` and `commit_state` consumes it.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/work_unit.rs` around lines 62 - 129, The
load/commit_state duplication and drift risk can be fixed by capturing the
computed total_shards and range during load and reusing them in commit_state
instead of re-reading state: after computing total_shards and range in load
(before calling BoundaryWork::load_ashard), store the values on the work unit
(e.g., add fields self.total_shards and/or self.range and assign them there),
and then change commit_state to take the boundary via self.boundary.as_mut() and
use the stored range/total_shards when calling boundary.commit_ashard::<D>(...)
rather than recomputing via load_epoch; update BoundaryWork usage only to
consume the already-stored range if needed.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/work.rs (2)</summary><blockquote>

`276-279`: **Test `TEST_TOTAL_SHARDS = 1` doesn't cover multi-shard sequencing in this state machine.**

These tests collapse `AShard` into the `EWrap` tag and set `TEST_TOTAL_SHARDS = 1`, so the multi-shard loop in `pop_work` (lines 226–245) — including the `next_index >= total_shards` transition — is exercised only with `total_shards = 1`. Worth adding at least one test with `TEST_TOTAL_SHARDS > 1` that asserts the work buffer emits exactly N `AShard` units before transitioning to `EwrapBoundary` (and that `last_point_seen` / cursor invariants hold across that transition). This catches off-by-one and ordering regressions in the boundary state machine that the current tests would silently miss.



Want me to draft this test?


Also applies to: 325-330

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 276 - 279, Add a new unit test that
sets TEST_TOTAL_SHARDS > 1 (instead of the current const TEST_TOTAL_SHARDS = 1)
to exercise the multi-shard sequencing in the pop_work logic: drive the
WorkBuffer/state machine (invoking pop_work) and assert that it emits exactly N
consecutive AShard units (N == TEST_TOTAL_SHARDS) before producing an
EwrapBoundary (or EWrap/Ewrap as used in this code), and verify that
last_point_seen and any cursor invariants are preserved across the transition;
target the test to exercise the pop_work path that contains the next_index >=
total_shards branch and use the existing helpers/setup used by the other tests
so it integrates with the same state machine and coverage.
```

</details>

---

`120-143`: **`account_shards == 0` "skip-shards" fallback contradicts `validate_total_shards`.**

`crates/cardano/src/ashard/shard.rs::validate_total_shards` rejects `0` outright ("account_shards must be >= 1"), yet both `on_ewrap_boundary` and the `PreEwrapBoundary` arm of `pop_work` treat `0` as "no shards — go straight to Ewrap". If a config bug or missing validation ever hands `0` here, you'll silently skip per-account boundary processing instead of failing fast. The defensive comment ("Shouldn't happen with a valid config") tacitly acknowledges this.

Either drop the branches and let an `unreachable!()` / `assert!(account_shards >= 1)` fail loudly, or change `validate_total_shards` to accept `0` as "no shards" if that's the intended semantics. Right now the two pieces disagree.




Also applies to: 210-225

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 120 - 143, The code currently treats
account_shards == 0 as a valid "no shards" path in on_ewrap_boundary (and
similarly in the PreEwrapBoundary arm of pop_work), which contradicts
validate_total_shards that requires account_shards >= 1; change the behavior to
fail-fast: remove the special-case branch that returns EwrapBoundary when
account_shards == 0 and instead assert or panic if account_shards < 1 (e.g., add
assert!(account_shards >= 1) at the start of on_ewrap_boundary), keeping the
normal Restart -> AShardingBoundary path for valid shard counts; alternatively,
if the intended semantics are that 0 means "no shards", update
validate_total_shards to accept 0 and document that invariant—pick one
consistent approach and apply the same fix in the PreEwrapBoundary/pop_work code
paths so both agree with validate_total_shards.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/ashard/shard.rs (1)</summary><blockquote>

`37-60`: **`debug_assert!` makes invalid `total_shards` a runtime div-by-zero in release.**

The contract is documented and there's a `validate_total_shards` helper, but the only enforcement inside `shard_key_range` is `debug_assert!`. In release builds, `total_shards == 0` reaches `let step = PREFIX_SPACE / total_shards;` and panics with a less informative message; non-divisor values silently produce truncated `step` arithmetic.

Since this is a public function and the cost of a real assertion (or returning `Result`) is negligible compared to the I/O the caller does, prefer `assert!` (or propagate the error from `validate_total_shards`).



<details>
<summary>🛡️ Proposed change</summary>

```diff
 pub fn shard_key_range(shard_index: u32, total_shards: u32) -> Range<EntityKey> {
-    debug_assert!(validate_total_shards(total_shards).is_ok());
-    debug_assert!(shard_index < total_shards);
+    assert!(
+        validate_total_shards(total_shards).is_ok(),
+        "invalid total_shards: {total_shards}"
+    );
+    assert!(shard_index < total_shards, "shard_index {shard_index} >= total_shards {total_shards}");
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/shard.rs` around lines 37 - 60, The function
shard_key_range uses debug_assert! to check validate_total_shards(total_shards)
and shard_index < total_shards, which leaves a division-by-zero and silent
truncation in release; change the debug_assert! checks to real runtime checks
(e.g., assert!(validate_total_shards(total_shards).is_ok()) and
assert!(shard_index < total_shards)) or convert the function to return Result
and propagate validate_total_shards() error from shard_key_range so PREFIX_SPACE
/ total_shards cannot divide by zero and invalid total_shards are handled
deterministically; update callers if you choose the Result approach and keep
references to shard_key_range, validate_total_shards, and PREFIX_SPACE when
making the change.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/ewrap/loading.rs (1)</summary><blockquote>

`280-285`: **`shard_applied_*` fields are dead state on Ewrap-constructed `BoundaryWork`.**

`new_empty` zero-initializes `shard_applied_effective`, `shard_applied_unspendable_to_treasury`, and `shard_applied_unspendable_to_reserves`, but Ewrap never reads/writes them — the per-shard accumulators land in `EpochState.end` via `EpochEndAccumulate` and are read from `ending_state.end` by `wrapup.flush`. The fields exist only because `BoundaryWork` is shared with the AShard path.

Not a bug, but the dual-use API obscures intent. If a future change adds a third caller, it's easy to misuse these fields. Consider either splitting `BoundaryWork` into `AShardWork`/`EwrapWork` or documenting on the struct which fields each phase owns.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ewrap/loading.rs` around lines 280 - 285, BoundaryWork's
shard_applied_effective / shard_applied_unspendable_to_treasury /
shard_applied_unspendable_to_reserves are dead for the Ewrap path (zeroed in
new_empty and never read by Ewrap; real per-shard accumulators live in
EpochState.end via EpochEndAccumulate and are consumed by wrapup.flush), so fix
by either splitting the dual-purpose struct into two clear types (e.g.,
AShardWork and EwrapWork) and move the AShard-only shard_applied_* fields into
AShardWork, or add explicit documentation/comments on BoundaryWork and those
three fields noting they are owned/used only by the AShard path; update
new_empty, any constructors, and usages of BoundaryWork (including references in
EpochEndAccumulate and wrapup.flush) to use the new types or to reflect the
documented ownership to prevent future misuse.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @crates/cardano/src/model/epochs.rs:

  • Around line 737-740: The apply path currently does
    entity.end.as_mut().expect("ESTART seeded EpochState.end before shards run")
    which panics if EpochState.end is None; instead ensure end is initialized to
    EndStats::default() when missing (so subsequent EpochWrapUp can overwrite it).
    Update the apply code in crates/cardano/src/model/epochs.rs to replace the
    expect with a safe initialization (e.g., use get_or_insert_with / get_or_insert
    or similar) so entity.end becomes Some(EndStats::default()) if it was None
    before mutating it.
  • Around line 684-771: EpochEndAccumulate::undo currently assumes apply always
    mutated state, causing underflow/corruption when apply returned early; fix by
    recording in apply whether it performed mutations and the prior values needed to
    revert (capture previous end.effective_rewards, end.unspendable_to_treasury,
    end.unspendable_to_reserves and previous entity.ashard_progress/AShardProgress)
    into fields on EpochEndAccumulate (or a was_applied flag plus prev_* fields),
    then make undo a no-op if apply was skipped or restore those captured prev_*
    values and previous ashard_progress instead of unconditionally subtracting and
    setting committed/total; update apply (function EpochEndAccumulate::apply) to
    populate these fields only when it actually mutates state and add a proptest
    exercising apply→undo when ashard_progress is already advanced past
    completed_shard_index to lock the behavior.

In @crates/core/src/config.rs:

  • Around line 988-1002: Add runtime validation to reject invalid account_shards
    values by calling the existing validate_total_shards() from
    crates/cardano/src/ashard/shard.rs during config initialization: when
    constructing/using CardanoConfig (e.g., inside CardanoLogic::initialize() before
    any use of CardanoConfig::account_shards() or DEFAULT_ACCOUNT_SHARDS), call
    validate_total_shards(self.account_shards.unwrap_or(Self::DEFAULT_ACCOUNT_SHARDS))
    and return/propagate an error if it fails; ensure the initialization path stops
    the startup with a clear error rather than allowing invalid values (0 or
    non-divisors of 256) to be used.

In @skills/debug-epoch-mismatch/SKILL.md:

  • Line 138: The documentation still treats reward application as a single EWRAP
    work unit but the code split moves per-account reward application into AShard
    while Ewrap now handles MIR/global closure; update the tables that list or
    reference EWRAP (notably the "Classify the Mismatch", "Dolos Work Units →
    Haskell Concepts", and the instrumentation hypothesis table) to distinguish
    AShard vs Ewrap: point applyRUpd + reward filtering, unspendable routing and
    registration filtering to AShard and point MIR and boundary close to Ewrap, and
    adjust any bisection/debugging guidance so it routes to AShard for per-account
    reward issues and to Ewrap for MIR/global closure issues. Ensure references
    include the file/class names AShard, Ewrap and the function/concept names
    applyRUpd, MIR, reward filtering, boundary close to make locating code
    straightforward.

Outside diff comments:
In @crates/cardano/src/ashard/rewards.rs:

  • Around line 114-188: The per-shard logs in flush don't indicate which shard
    they belong to; update the flush logging to include shard context (shard_index
    and total_shards). Add shard_index: usize and total_shards: usize as fields on
    the AShard struct (or otherwise make them available on self or ctx), ensure any
    AShard construction/population passes these values, and then include
    %shard_index and %total_shards in the tracing::debug! and tracing::error!
    invocations (and the debug! at the end) around ctx.rewards/ drain_unspendable/
    applied_reward_credentials so all per-shard telemetry shows shard scope.

In @crates/cardano/src/model/epochs.rs:

  • Around line 980-981: The rustdoc warning comes from using a /// doc comment
    immediately above the prop_compose! macro (related to EpochStatsUpdate::apply
    and rolling.live_mut which asserts rolling.next is None), so replace that
    leading /// comment with a non-doc comment (//) or move the explanation into the
    generator body/closure so rustdoc won't try to document the macro invocation;
    specifically update the comment above the prop_compose! invocation that
    references rolling.live_mut and rolling.next to use // (or relocate it inside
    the prop_compose! closure) to silence the unused_doc_comments lint.

Nitpick comments:
In @crates/cardano/src/ashard/loading.rs:

  • Around line 19-57: The function load_pending_rewards_range currently accepts
    Option<Range> but the only caller (load_ashard) always passes
    Some(range), so tighten the signature to take Range directly: change
    the parameter type in load_pending_rewards_range and update callers (e.g., in
    load_ashard remove Some(range.clone()) and pass range.clone() directly); ensure
    the body drops any pattern-matching for None (no-op) and uses the range value
    when calling
    state.iter_entities_typed::(PendingRewardState::NS, range).

In @crates/cardano/src/ashard/rewards.rs:

  • Around line 56-77: Update the stale comment referencing "EWRAP" to reflect the
    new pipeline stage: change the text around the account deregistration
    explanation to say "deregistered before AShard" (or "before the boundary's
    per-account leg") so it matches where the per-account registration check now
    runs; the relevant area to edit is the comment above the account.is_registered()
    branch in the function that calls reward.total_value() and
    ctx.rewards.return_reward_to_treasury(total). Ensure the updated comment keeps
    the existing clarification about pre-filtered accounts and returned_rewards
    behavior.

In @crates/cardano/src/ashard/shard.rs:

  • Around line 37-60: The function shard_key_range uses debug_assert! to check
    validate_total_shards(total_shards) and shard_index < total_shards, which leaves
    a division-by-zero and silent truncation in release; change the debug_assert!
    checks to real runtime checks (e.g.,
    assert!(validate_total_shards(total_shards).is_ok()) and assert!(shard_index <
    total_shards)) or convert the function to return Result and propagate
    validate_total_shards() error from shard_key_range so PREFIX_SPACE /
    total_shards cannot divide by zero and invalid total_shards are handled
    deterministically; update callers if you choose the Result approach and keep
    references to shard_key_range, validate_total_shards, and PREFIX_SPACE when
    making the change.

In @crates/cardano/src/ashard/work_unit.rs:

  • Around line 62-129: The load/commit_state duplication and drift risk can be
    fixed by capturing the computed total_shards and range during load and reusing
    them in commit_state instead of re-reading state: after computing total_shards
    and range in load (before calling BoundaryWork::load_ashard), store the values
    on the work unit (e.g., add fields self.total_shards and/or self.range and
    assign them there), and then change commit_state to take the boundary via
    self.boundary.as_mut() and use the stored range/total_shards when calling
    boundary.commit_ashard::(...) rather than recomputing via load_epoch; update
    BoundaryWork usage only to consume the already-stored range if needed.

In @crates/cardano/src/ewrap/commit.rs:

  • Around line 91-100: The current call to stream_and_apply_namespace::<D,
    AccountState>(state, &writer, None) performs a full-table scan of AccountState
    each boundary; instead implement a targeted path that iterates over
    self.applied_mir_credentials (or the credentials from queued AssignRewards
    deltas), point-reads each AccountState, applies the MIR delta, and writes the
    result via the same writer so only affected accounts are touched; keep
    stream_and_apply_namespace for full-set namespaces like
    Pool/DRep/Proposal/EpochState, and ensure the new code reuses the existing
    apply/serialize logic used by stream_and_apply_namespace to maintain invariants
    and error handling.

In @crates/cardano/src/ewrap/loading.rs:

  • Around line 280-285: BoundaryWork's shard_applied_effective /
    shard_applied_unspendable_to_treasury / shard_applied_unspendable_to_reserves
    are dead for the Ewrap path (zeroed in new_empty and never read by Ewrap; real
    per-shard accumulators live in EpochState.end via EpochEndAccumulate and are
    consumed by wrapup.flush), so fix by either splitting the dual-purpose struct
    into two clear types (e.g., AShardWork and EwrapWork) and move the AShard-only
    shard_applied_* fields into AShardWork, or add explicit documentation/comments
    on BoundaryWork and those three fields noting they are owned/used only by the
    AShard path; update new_empty, any constructors, and usages of BoundaryWork
    (including references in EpochEndAccumulate and wrapup.flush) to use the new
    types or to reflect the documented ownership to prevent future misuse.

In @crates/cardano/src/lib.rs:

  • Around line 284-341: Extract and reuse the epoch load and the
    effective-account-shards logic: in initialize(), call load_epoch::(state)
    once and store the resulting EpochState (if Ok) for the warnings; replace the
    inline computation of effective_account_shards with a call to the existing
    helper read_effective_account_shards (or lift that helper to a free fn if
    necessary) so the "persisted total else config" rule is implemented in one
    place; update references to use the single loaded epoch value for the
    tracing::warn! blocks and derive effective_account_shards from
    read_effective_account_shards(state, &config) (or equivalent) to remove the
    duplicate load_epoch::(state) call.
  • Around line 404-408: The code refreshes self.effective_account_shards on every
    pop_work by calling read_effective_account_shards::(domain.state()), but this
    value only matters when the epoch boundary handoff
    (EpochState.ashard_progress.total) can change; avoid the repeated read by first
    checking whether ashard_progress has changed and only calling
    read_effective_account_shards when it has. Add a small cached marker on the
    struct (e.g., last_ashard_total or last_ashard_progress) and in pop_work compare
    domain.state().ashard_progress.total (or the appropriate accessor) to that
    cached value; if different, call read_effective_account_shards and update both
    effective_account_shards and the cached marker, otherwise skip the read and keep
    the existing effective_account_shards.

In @crates/cardano/src/work.rs:

  • Around line 276-279: Add a new unit test that sets TEST_TOTAL_SHARDS > 1
    (instead of the current const TEST_TOTAL_SHARDS = 1) to exercise the multi-shard
    sequencing in the pop_work logic: drive the WorkBuffer/state machine (invoking
    pop_work) and assert that it emits exactly N consecutive AShard units (N ==
    TEST_TOTAL_SHARDS) before producing an EwrapBoundary (or EWrap/Ewrap as used in
    this code), and verify that last_point_seen and any cursor invariants are
    preserved across the transition; target the test to exercise the pop_work path
    that contains the next_index >= total_shards branch and use the existing
    helpers/setup used by the other tests so it integrates with the same state
    machine and coverage.
  • Around line 120-143: The code currently treats account_shards == 0 as a valid
    "no shards" path in on_ewrap_boundary (and similarly in the PreEwrapBoundary arm
    of pop_work), which contradicts validate_total_shards that requires
    account_shards >= 1; change the behavior to fail-fast: remove the special-case
    branch that returns EwrapBoundary when account_shards == 0 and instead assert or
    panic if account_shards < 1 (e.g., add assert!(account_shards >= 1) at the start
    of on_ewrap_boundary), keeping the normal Restart -> AShardingBoundary path for
    valid shard counts; alternatively, if the intended semantics are that 0 means
    "no shards", update validate_total_shards to accept 0 and document that
    invariant—pick one consistent approach and apply the same fix in the
    PreEwrapBoundary/pop_work code paths so both agree with validate_total_shards.

In @crates/cardano/work_units.md:

  • Around line 5-11: The fenced code block containing the epoch diagram starting
    with "Estart → Roll … → Rupd → Roll … → AShard ×N → Ewrap" is missing
    a language tag (MD040); fix it by adding a language identifier (use "text")
    immediately after the opening triple backticks so the block becomes ```text and
    the diagram passes markdownlint-cli2 validation.

In @tests/epoch_pots/main.rs:

  • Around line 500-513: Collapse the two separate if let Some(boundary) checks
    into a single binding for boundary when matching
    CardanoWorkUnit::AShard(shard): inside one if let Some(boundary) = shard.boundary() block first check if shard.shard_index() == 0 to clear
    accumulated_applied and set accumulated_ending_epoch = Some(boundary.ending_state().number), then unconditionally call
    accumulated_applied.extend(boundary.applied_rewards.iter().cloned()); this
    removes the duplicated boundary lookup and makes the reset-and-extend logic
    contiguous.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `13b48d06-fb3d-459b-b0a0-59ac2b808397`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 885094deed721abaa8bca14087703bd7ea409c28 and 8174d2a832bd81baa5228d8b3059b0b2e0bfd7ba.

</details>

<details>
<summary>📒 Files selected for processing (23)</summary>

* `crates/cardano/src/ashard/commit.rs`
* `crates/cardano/src/ashard/loading.rs`
* `crates/cardano/src/ashard/mod.rs`
* `crates/cardano/src/ashard/rewards.rs`
* `crates/cardano/src/ashard/shard.rs`
* `crates/cardano/src/ashard/work_unit.rs`
* `crates/cardano/src/ewrap/commit.rs`
* `crates/cardano/src/ewrap/loading.rs`
* `crates/cardano/src/ewrap/mod.rs`
* `crates/cardano/src/ewrap/work_unit.rs`
* `crates/cardano/src/ewrap/wrapup.rs`
* `crates/cardano/src/genesis/mod.rs`
* `crates/cardano/src/lib.rs`
* `crates/cardano/src/model/epochs.rs`
* `crates/cardano/src/model/mod.rs`
* `crates/cardano/src/model/pending.rs`
* `crates/cardano/src/rupd/work_unit.rs`
* `crates/cardano/src/work.rs`
* `crates/cardano/work_units.md`
* `crates/core/src/config.rs`
* `skills/debug-epoch-mismatch/SKILL.md`
* `tests/epoch_pots/main.rs`
* `tests/memory.rs`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread crates/cardano/src/model/epochs.rs
Comment thread crates/cardano/src/model/epochs.rs
Comment thread crates/core/src/config.rs Outdated
Comment thread skills/debug-epoch-mismatch/SKILL.md Outdated
scarmuega and others added 3 commits April 26, 2026 12:29
apply has three early-return guards (idempotent repeat, out-of-order,
total_shards mismatch) that leave state untouched. undo unconditionally
subtracted the deltas and overwrote ashard_progress, so a rollback
following a skipped apply would underflow the u64 end.* fields and
clobber the cursor.

Capture prev_ashard_progress and set an applied flag during apply only
when state is actually mutated; undo early-returns when !applied and
restores from the snapshot. Same pattern as EpochWrapUp/EpochStatsUpdate.

Also broaden any_epoch_state to vary ashard_progress so the existing
roundtrip proptests for EpochWrapUp and EpochTransition exercise the
Some(_) → None → Some(_) path their apply/undo introduced earlier in
this branch (previously only None → None was covered). Add a dedicated
epoch_end_accumulate_roundtrip proptest covering all four progress
shapes and all three skip branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The divides-256 invariant on account_shards is enforced only via
debug_assert! in shard_key_range(), which is stripped in release
builds. An invalid TOML value (0, 3, 7, 100, ...) would deserialize
cleanly and silently corrupt key-range coverage.

Call validate_total_shards() at the top of CardanoLogic::initialize
and surface failures as ChainError::InvalidConfig so misconfiguration
fails the startup with a clear message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After this branch's restructure, per-account reward application,
unspendable routing, and EWRAP-time registration filtering live in
AShard; only MIRs, refunds, and boundary close remain in Ewrap.
Update the Classify, Work Units, Source Files, and Instrumentation
tables so the bisection workflow points at the right work unit
(and module path) for each failure shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/cardano/src/lib.rs (1)

250-258: Minor: deduplicate effective-shard-count computation.

The body of read_effective_account_shards (lines 253-258) is identical to the inline block at lines 345-348 in initialize. Since Self isn't yet constructed there, the method form can't be reused, but extracting a free helper (or Self::compute_effective_account_shards<D>(state, &config)) would keep both call sites in sync if the fallback rule ever changes.

♻️ Suggested refactor
-    /// Compute the effective `account_shards` value: stored
-    /// `ashard_progress.total` if a boundary is in flight, otherwise the
-    /// configured value.
-    fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 {
-        load_epoch::<D>(state)
-            .ok()
-            .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
-            .unwrap_or_else(|| self.config.account_shards())
-    }
+    /// Compute the effective `account_shards` value: stored
+    /// `ashard_progress.total` if a boundary is in flight, otherwise the
+    /// configured value.
+    fn compute_effective_account_shards<D: Domain>(state: &D::State, config: &CardanoConfig) -> u32 {
+        load_epoch::<D>(state)
+            .ok()
+            .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
+            .unwrap_or_else(|| config.account_shards())
+    }
+
+    fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 {
+        Self::compute_effective_account_shards::<D>(state, &self.config)
+    }

Then at line 345-348:

-        let effective_account_shards = load_epoch::<D>(state)
-            .ok()
-            .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
-            .unwrap_or_else(|| config.account_shards());
+        let effective_account_shards =
+            Self::compute_effective_account_shards::<D>(state, &config);

Also applies to: 343-348

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 250 - 258, The effective-shard-count
computation is duplicated between the method read_effective_account_shards and
the inline block in initialize; extract the logic into a single helper (either a
free function like compute_effective_account_shards::<D>(state, &config) or an
associated fn Self::compute_effective_account_shards::<D>(state, &config)) that
returns the u32 by loading the epoch and falling back to
config.account_shards(), then replace both the body of
read_effective_account_shards and the inline compute in initialize to call that
helper so both call sites stay in sync if the fallback changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/cardano/src/lib.rs`:
- Around line 250-258: The effective-shard-count computation is duplicated
between the method read_effective_account_shards and the inline block in
initialize; extract the logic into a single helper (either a free function like
compute_effective_account_shards::<D>(state, &config) or an associated fn
Self::compute_effective_account_shards::<D>(state, &config)) that returns the
u32 by loading the epoch and falling back to config.account_shards(), then
replace both the body of read_effective_account_shards and the inline compute in
initialize to call that helper so both call sites stay in sync if the fallback
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3543b5af-c9cb-43d0-942e-a2c733a526c3

📥 Commits

Reviewing files that changed from the base of the PR and between 8174d2a and aa835bb.

📒 Files selected for processing (4)
  • crates/cardano/src/lib.rs
  • crates/cardano/src/model/epochs.rs
  • crates/core/src/lib.rs
  • skills/debug-epoch-mismatch/SKILL.md
✅ Files skipped from review due to trivial changes (1)
  • crates/core/src/lib.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • skills/debug-epoch-mismatch/SKILL.md

scarmuega and others added 6 commits April 26, 2026 19:52
Wraps every phase (load/compute/commit_*) of `execute_work_unit` in
sync and import with an `RssProbe` that emits an `info!` event with
`phase`, `rss_before_mb`, `rss_after_mb`, `rss_delta_mb`. Attaches via
the surrounding `#[instrument(name = "work_unit")]` span so the work
unit's name is included automatically. Helps localize boundary memory
spikes (RUPD/AShard/Ewrap/ESTART) without per-crate boilerplate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Account/PendingReward keys are 32-byte CBOR encodings of `StakeCredential`
whose first 4 bytes (`0x82 <variant> 0x58 0x1c`) carry no entropy across
credentials. Sharding by `key[0]` therefore funnelled every credential
into a single bucket regardless of `account_shards`, so only one shard
ever did real work.

`shard_key_range` becomes `shard_key_ranges` and returns one range per
`StakeCredential` variant, sliced on `key[4]` (first byte of the actual
hash). Each AShard now scans two contiguous ranges that together cover
~1/N of the credential keyspace, giving even per-shard work without any
data migration. The CBOR layout invariant is asserted in a unit test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoist the credential-keyed shard helpers from `ashard/shard.rs` to a
top-level `crate::shard` module and apply the same per-shard pattern
to ESTART so per-account snapshot rotations stream through bounded-
memory shards instead of accumulating millions of `AccountTransition`
deltas in one Vec.

New work unit `EStartShardWorkUnit` lives in a sibling `estart_shard/`
module that adds shard-aware load/commit methods to `WorkContext`,
mirroring the `ashard/` ↔ `ewrap/` relationship. The existing
`EstartWorkUnit` is repurposed as the finalize half: pool / drep /
proposal transitions, the closing `EpochTransition` (epoch advance +
new pots + era migration), archive logs, and the cursor advance
(which only ever moves here — never per shard).

Boundary pipeline is now:
  Blocks → AShard×N → Ewrap → EStartShard×N → Estart(finalize) → Blocks

Same `CardanoConfig::account_shards` drives both halves. Progress is
tracked separately on `EpochState.estart_shard_progress` (parallel to
`ashard_progress`); only one of the two is ever populated at once.
`EStartShardAccumulate` mirrors `EpochEndAccumulate`'s idempotency,
ordering, and total-mismatch guards. `EpochTransition` snapshots and
clears both progress fields. The crash-recovery warning at startup
covers both halves; full mid-EStart-shard resume remains the same TODO
posture as the existing AShard pipeline (`AccountTransition` is not
natively idempotent on re-apply).

`AShardProgress` is renamed to `ShardProgress` since it now serves
two phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update sequence diagram to show the open-half pipeline
(EStartShard ×N → Estart) alongside the close-half (AShard ×N → Ewrap),
add a new section for `EStartShardWorkUnit`, and trim Estart's section
to its finalize-only delta set (drops `AccountTransition`, calls out
that `EpochTransition` now clears both `ashard_progress` and
`estart_shard_progress`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshape the epoch-boundary pipeline so each half is one self-contained
work unit instead of four:

- `WorkUnit` trait gains `total_shards()`, `initialize()`, `finalize()`,
  and a `shard_index: u32` parameter on the per-phase methods. The core
  executor (`crates/core/src/sync.rs::run_lifecycle`) loops the
  load/compute/commit cycle once per shard between `initialize` and
  `finalize`. Default `total_shards = 1` keeps non-sharded work units
  untouched.

- `EwrapWorkUnit` (close half, formerly `AShardWorkUnit`) and
  `EstartWorkUnit` (open half, formerly `EStartShardWorkUnit`) absorb
  their respective global passes via `finalize()`. The pre-existing
  `Ewrap`/`Estart` work units are removed; `CardanoWorkUnit` shrinks
  from 7 variants to 5 and `WorkBuffer` from 12 states to 10. Stop-epoch
  logic moves into `EstartBoundary`'s transition (cursor still advances
  only there).

- All code that runs as part of a single work unit lives under one
  module: `ashard/` + `ewrap/` collapsed into `crates/cardano/src/ewrap/`,
  `estart_shard/` + `estart/` collapsed into `crates/cardano/src/estart/`.
  AVVM reclamation hoisted out of the per-shard `load` into `initialize`.

- Persisted state field names (`ashard_progress`, `estart_shard_progress`)
  and the Serde-tagged `EStartShardAccumulate` delta are preserved to
  avoid an on-disk migration.

The `WorkBuffer` no longer enumerates shards; `CardanoLogic` drops the
`effective_account_shards` cache. Crash recovery still relies on the
existing `committed` guards in `EpochEndAccumulate` /
`EStartShardAccumulate` — same correctness posture as before.

Test harness gains a post-finalize callback (fires once with
`shard_index == total_shards`) so per-boundary introspection (e.g.
`epoch_pots`) sees the global teardown state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dary nomenclature

Rename the per-shard progress deltas and the persisted EpochState
progress fields to match the post-merge work-unit identities (Ewrap /
Estart):

- `EpochEndAccumulate` → `EWrapProgress` (carries shard reward
  accumulators + cursor)
- `EStartShardAccumulate` → `EStartProgress` (cursor only — per-account
  Estart work lands directly on AccountState)
- `EpochState.ashard_progress` → `EpochState.ewrap_progress`
- `EpochState.estart_shard_progress` → `EpochState.estart_progress`
- `EWrapProgress.prev_ashard_progress` → `prev_ewrap_progress`
- `EStartProgress.prev_estart_shard_progress` → `prev_estart_progress`

CBOR compatibility preserved: the minicbor `#[n(15)]` / `#[n(16)]`
positional indices on the EpochState fields are unchanged, so existing
on-disk state deserializes unmodified. Field-name changes only affect
serde-tagged paths (ad-hoc JSON dumps, debug prints), not the durable
state.

The delta type rename does change the `CardanoDelta` enum's serde
encoding (variant tag is the type name), so any node mid-sync with a
populated WAL will need to wipe / re-bootstrap. Fresh nodes are
unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
crates/cardano/src/ewrap/loading.rs (1)

159-165: Acknowledge the ordering constraint; consider tracking the TODO.

The comment clearly documents the rewards-before-drops ordering requirement and its rationale. The TODO to move retires to ESTART would eliminate this ordering hack.

Would you like me to open an issue to track this refactoring task (moving retires to ESTART after snapshot)?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ewrap/loading.rs` around lines 159 - 165, Acknowledge the
rewards-before-drops ordering constraint and record the TODO by opening a
tracking issue: create an issue titled something like "Refactor: move retires to
ESTART to remove rewards-before-drops ordering hack" referencing the comment in
crates/cardano/src/ewrap/loading.rs, summarizing the current hack (rewards must
apply before drops because refunds clone live values pre-snapshot), and add
acceptance criteria that retires are moved to ESTART and the ordering
hack/comment removed; include links to the affected code and label it as a
refactor/tech-debt item for prioritization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/cardano/src/estart/work_unit.rs`:
- Around line 91-120: The initialize method only restores estart_progress.total
but not the committed shard index, so on restart shards will be replayed from 0
and non-idempotent AccountTransition will double-rotate accounts; update
initialize (in work_unit.rs) to read estart_progress.committed (if present) and
set the work-unit's starting shard/next-shard field accordingly (fall back to 0
when absent), and also update the shard iteration in the core runner (the loop
in crates/core/src/sync.rs that currently starts at 0) to begin at that restored
committed index so earlier shards are skipped or treated as already applied
before building/applying account deltas; reference symbols: initialize,
estart_progress.committed, EStartProgress, and the shard iteration in
crates/core/src/sync.rs.

In `@crates/cardano/src/ewrap/commit.rs`:
- Around line 169-198: stream_and_apply_namespace::<D, EpochState> applies the
final EpochState into the writer but does not update
BoundaryWork.ending_state(), so archive_writer.write_log_typed is archiving a
stale self.ending_state(); capture the applied EpochState from the stream path
(e.g. return or output value of stream_and_apply_namespace for EpochState) or
reload the EpochState from the writer/namespace after streaming, then either
assign that value into self.ending_state() or pass that captured EpochState
directly to archive_writer.write_log_typed instead of using self.ending_state()
so the archived record matches the live, post-EpochWrapUp state.

In `@crates/cardano/src/shard.rs`:
- Around line 66-74: The function shard_key_ranges currently uses debug_assert!
for validate_total_shards and shard_index checks which are removed in release
builds, causing division by zero or incorrect partitioning (PREFIX_SPACE /
total_shards) at runtime; change the function to return a Result or use
assert!/explicit validation so invalid total_shards or shard_index produce a
clear error. Specifically, replace the debug_assert! calls in shard_key_ranges
with a runtime check that calls validate_total_shards(total_shards) and returns
Err(...) on failure (or panics with assert! including the failing value), and
also validate shard_index < total_shards before calling variant_range; ensure
callers (e.g., code that reads ShardProgress.total) handle the Result if you
choose the Result approach.

In `@crates/cardano/work_units.md`:
- Around line 5-16: The fenced block containing the sequence diagram starting
with "Estart  →  Roll …  →  Rupd  →  Roll …  →  Ewrap" should include a language
tag to satisfy markdownlint MD040; update the opening fence from ``` to ```text
so the block is treated as plain text (leave the diagram content unchanged).
- Around line 141-143: Update the documentation to reference the new loader
name: replace the obsolete mention of BoundaryWork::load_ewrap with
BoundaryWork::load_finalize; locate the ewrap finalize docs that currently
describe "Builds a fresh `BoundaryWork` via `BoundaryWork::load_ewrap`" and
change that text to point to `BoundaryWork::load_finalize`, ensuring any
surrounding description or links reflect the new API name used in the
implementation (e.g., the loader invoked in the ewrap work unit).

In `@crates/core/src/work_unit.rs`:
- Around line 109-118: total_shards() must never return 0: change its return
type from u32 to core::num::NonZeroU32 (or alternatively enforce a runtime check
in the executor before running phases) so that a work unit cannot skip all
per-shard phases; update implementations that derive shard count in initialize()
to store and return a NonZeroU32 cached value, and update any call sites
(executor logic that iterates
load/compute/commit_wal/commit_state/commit_archive/commit_indexes) to accept
and unwrap the NonZeroU32 safely; if you prefer runtime validation instead, add
an explicit check in the executor (before running load/compute/commit_* but
after initialize()) that rejects total_shards() == 0 with a clear error.

In `@tests/bootstrap.rs`:
- Around line 40-49: The test helper currently calls
WorkUnit::<ToyDomain>::finalize(...) even though commit_archive() and
commit_indexes() are intentionally skipped, so it no longer models the "crash
after state commit" boundary; remove the finalize() call (or guard it behind a
flag) so the helper stops after WorkUnit::<ToyDomain>::commit_state(...),
leaving commit_archive() and commit_indexes() unexecuted to accurately simulate
the crash-after-state scenario and allow recovery paths to be tested.

In `@tests/memory.rs`:
- Around line 141-163: The test currently only samples Region::change() before
iterator use, so update it to also sample after consuming the iterator: call let
stats_before = reg.change() (or reuse heap_delta), consume the iterator with let
count = iter.count(), then call let stats_after = reg.change() and compute let
heap_delta_consumption = stats_after.bytes_allocated -
stats_before.bytes_allocated; finally assert that heap_delta_consumption is
below the same threshold (or a separate reasonable threshold) to ensure
iter_entities + iteration is heap-bounded. Reference the existing iter_entities,
iter.count(), Region::change(), and heap_delta/threshold variables when adding
the second assertion.

---

Nitpick comments:
In `@crates/cardano/src/ewrap/loading.rs`:
- Around line 159-165: Acknowledge the rewards-before-drops ordering constraint
and record the TODO by opening a tracking issue: create an issue titled
something like "Refactor: move retires to ESTART to remove rewards-before-drops
ordering hack" referencing the comment in crates/cardano/src/ewrap/loading.rs,
summarizing the current hack (rewards must apply before drops because refunds
clone live values pre-snapshot), and add acceptance criteria that retires are
moved to ESTART and the ordering hack/comment removed; include links to the
affected code and label it as a refactor/tech-debt item for prioritization.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef9f2a57-98e9-488b-a6d0-c5240e79109d

📥 Commits

Reviewing files that changed from the base of the PR and between aa835bb and 62031ea.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (32)
  • crates/cardano/src/estart/commit.rs
  • crates/cardano/src/estart/loading.rs
  • crates/cardano/src/estart/mod.rs
  • crates/cardano/src/estart/reset.rs
  • crates/cardano/src/estart/work_unit.rs
  • crates/cardano/src/ewrap/commit.rs
  • crates/cardano/src/ewrap/loading.rs
  • crates/cardano/src/ewrap/mod.rs
  • crates/cardano/src/ewrap/refunds.rs
  • crates/cardano/src/ewrap/rewards.rs
  • crates/cardano/src/ewrap/work_unit.rs
  • crates/cardano/src/ewrap/wrapup.rs
  • crates/cardano/src/genesis/mod.rs
  • crates/cardano/src/genesis/work_unit.rs
  • crates/cardano/src/lib.rs
  • crates/cardano/src/model/epochs.rs
  • crates/cardano/src/model/mod.rs
  • crates/cardano/src/model/pending.rs
  • crates/cardano/src/roll/work_unit.rs
  • crates/cardano/src/rupd/work_unit.rs
  • crates/cardano/src/shard.rs
  • crates/cardano/src/work.rs
  • crates/cardano/work_units.md
  • crates/core/Cargo.toml
  • crates/core/src/config.rs
  • crates/core/src/import.rs
  • crates/core/src/sync.rs
  • crates/core/src/work_unit.rs
  • crates/testing/src/harness/cardano.rs
  • tests/bootstrap.rs
  • tests/epoch_pots/main.rs
  • tests/memory.rs
✅ Files skipped from review due to trivial changes (5)
  • crates/cardano/src/estart/mod.rs
  • crates/cardano/src/ewrap/refunds.rs
  • crates/core/Cargo.toml
  • crates/cardano/src/ewrap/rewards.rs
  • crates/cardano/src/model/pending.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/core/src/config.rs

Comment thread crates/cardano/src/estart/work_unit.rs
Comment thread crates/cardano/src/ewrap/commit.rs
Comment thread crates/cardano/src/shard.rs
Comment thread crates/cardano/work_units.md Outdated
Comment thread crates/cardano/work_units.md Outdated
Comment on lines +109 to +118
/// Number of shards this work unit splits into.
///
/// The executor calls each per-shard phase `total_shards()` times.
/// Defaults to `1` for non-sharded work units. The returned value
/// must be valid after `initialize()` has run — implementations that
/// derive the count from persisted state should compute it inside
/// `initialize()` and cache it on `self`.
fn total_shards(&self) -> u32 {
1
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make zero shards impossible.

total_shards() == 0 now means no load/compute/commit_* phase runs, yet the work unit can still proceed to finalize() and post-work hooks. One bad implementation can therefore "complete" without processing any shard. Please encode this as NonZeroU32 or reject zero before execution.

Based on learnings: Work units function as mini-ETL jobs that extract data from storage, transform it using chain-specific logic, and load results into appropriate stores following the sequence: load() → compute() → commit_wal() → commit_state() → commit_archive() → commit_indexes().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/core/src/work_unit.rs` around lines 109 - 118, total_shards() must
never return 0: change its return type from u32 to core::num::NonZeroU32 (or
alternatively enforce a runtime check in the executor before running phases) so
that a work unit cannot skip all per-shard phases; update implementations that
derive shard count in initialize() to store and return a NonZeroU32 cached
value, and update any call sites (executor logic that iterates
load/compute/commit_wal/commit_state/commit_archive/commit_indexes) to accept
and unwrap the NonZeroU32 safely; if you prefer runtime validation instead, add
an explicit check in the executor (before running load/compute/commit_* but
after initialize()) that rejects total_shards() == 0 with a clear error.

Comment thread tests/bootstrap.rs Outdated
Comment thread tests/memory.rs Outdated
scarmuega and others added 3 commits April 28, 2026 09:53
RUPD was the boundary path's worst memory spike: a single load built
the full per-account `accounts_by_pool` + `registered_accounts` plus
an O(N) `RewardMap`. Mirror the Estart/Ewrap pattern — hoist
pool-bounded globals into `initialize()` (pots, incentives, pparams,
pool snapshots, pool_stake totals) and shard the per-credential leg
across the same `account_shards` partitions: each shard streams
`AccountState` over its two key ranges only, builds a shard-scoped
delegator + registered set, runs `define_rewards` over every pool
but emits only in-range credentials, and writes the in-range
`PendingRewardState` entities.

Leader-reward emission gates on a new default-true
`RewardsContext::should_include`, so the shard whose range contains
the operator credential is the sole emitter for that pool's leader
reward. Delegator emissions are filtered naturally via
`pool_delegators` returning only in-range creds, with a defensive
`should_include` check at the merge site.

Per-shard progress is tracked by a new `RupdProgress` delta on
`EpochState.rupd_progress`, with the same idempotency + ordering +
total-mismatch guards as `EWrapProgress` / `EStartProgress`.
`EpochTransition` rollback now also captures and clears
`rupd_progress`.  `EpochState.incentives` and per-pool `StakeLog`
archive entries move to `finalize()` (one-shot, after every shard
has committed) so concurrent shard commits can't race; the per-pool
reward + delegator-count contributions accumulate on the work unit
as O(pools) state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…constant

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
crates/core/src/work_unit.rs (1)

60-69: ⚠️ Potential issue | 🟠 Major

Reject 0 shards in the trait contract.

A u32 return still lets an implementation report 0 after initialize(), which means the executor can skip every load / compute / commit_* phase and jump straight to finalize(). Make this NonZeroU32 or add an explicit runtime check in the runner.

Based on learnings: Work units function as mini-ETL jobs that extract data from storage, transform it using chain-specific logic, and load results into appropriate stores following the sequence: load() → compute() → commit_wal() → commit_state() → commit_archive() → commit_indexes().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/core/src/work_unit.rs` around lines 60 - 69, The trait currently
allows total_shards() to return 0 which skips all shard phases; fix by making
the contract non-zero: change the return type of total_shards() to
std::num::NonZeroU32 (update implementations to cache a NonZeroU32 in
initialize()), or if you prefer not to change the trait signature, add an
explicit runtime validation in the runner/executor (after calling initialize()
and reading total_shards()) that checks shards != 0 and returns an Err or panics
with a clear message; refer to the total_shards() method and initialize() call
sites in the executor/runner that loop over shards to add the check.
crates/cardano/src/estart/work_unit.rs (1)

93-106: ⚠️ Potential issue | 🔴 Critical

Resume from estart_progress.committed, and don't mask epoch read failures.

initialize() still restores only estart_progress.total. After a crash, the executor will rerun shards 0..k-1 unless it also resumes from committed, and this file already documents AccountTransition as non-idempotent. Falling back to ACCOUNT_SHARDS on load_epoch errors makes that restart path even less safe by silently repartitioning the boundary.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/estart/work_unit.rs` around lines 93 - 106, The
initialize() logic must resume from estart_progress.committed and must not
silently mask load_epoch failures; instead of unconditionally falling back to
ACCOUNT_SHARDS on Err(_), propagate the error and when load_epoch succeeds, set
self.total_shards from epoch.estart_progress.as_ref().map(|p|
p.total).unwrap_or(ACCOUNT_SHARDS) and also set the worker's resume point from
epoch.estart_progress.as_ref().and_then(|p| p.committed) (e.g., assign to
whatever field tracks the next shard to run), so shards restart from the
committed position; remove the Err(_) => ACCOUNT_SHARDS branch and return the
DomainError on read failure.
crates/cardano/src/shard.rs (1)

76-79: ⚠️ Potential issue | 🟠 Major

Validate shard counts in release builds too.

These guards disappear in release, but total_shards now comes from persisted ShardProgress.total as well as the constant. A bad value can still panic on PREFIX_SPACE / total_shards or produce broken partitions at runtime, so this needs a real assert! or Result path instead of debug_assert!.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/shard.rs` around lines 76 - 79, The debug-only checks in
shard_key_ranges are unsafe; replace the debug_assert! calls with real runtime
validation: call validate_total_shards(total_shards) and return/propagate an
error on failure (or use assert! if you prefer panicking) and explicitly check
shard_index < total_shards, returning Err for invalid inputs; ensure callers
that read persisted ShardProgress.total handle the Result. Update
shard_key_ranges (and its callers) to use the new Result-returning signature or
use assert! so that division by PREFIX_SPACE / total_shards and partitioning
logic cannot run with invalid total_shards.
🧹 Nitpick comments (2)
crates/cardano/src/model/epochs.rs (1)

780-783: The expect() on entity.end remains fragile.

Per the past review comment, this expect() relies on Genesis seeding end = Some(EndStats::default()) and EpochTransition re-seeding it on every subsequent transition. While the comment was marked as addressed, the code still uses expect() rather than the suggested get_or_insert_with(EndStats::default) approach, which would be more defensive.

This is low-risk since the invariant is maintained by the current code paths, but consider the defensive approach for robustness against future regressions.

🛡️ Optional defensive change
-        let end = entity
-            .end
-            .as_mut()
-            .expect("ESTART seeded EpochState.end before shards run");
+        let end = entity.end.get_or_insert_with(EndStats::default);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/model/epochs.rs` around lines 780 - 783, Replace the
fragile expect() on entity.end with a defensive get_or_insert_with to ensure
EndStats is initialized if missing; specifically, change the code path in the
epochs handling where entity.end is accessed (currently using
.end.as_mut().expect("ESTART seeded EpochState.end before shards run")) to call
.end.get_or_insert_with(EndStats::default) or equivalent so the EndStats default
is guaranteed without panicking, leaving the rest of the logic in the
EpochTransition/epochs handling unchanged.
crates/cardano/src/lib.rs (1)

349-448: Consider adding crash-recovery check for rupd_progress.

The initialization checks ewrap_progress and estart_progress for crash-recovery scenarios, but rupd_progress is not checked. Since RUPD also uses the same sharded pattern with RupdProgress deltas, operators would benefit from similar logging if a crash occurred mid-RUPD.

📝 Suggested addition after line 447
// Same crash-recovery check for the RUPD phase.
if let Some(progress) = epoch.rupd_progress.as_ref() {
    let configured = crate::shard::ACCOUNT_SHARDS;
    if progress.total != configured {
        tracing::warn!(
            epoch = epoch.number,
            stored_total = progress.total,
            configured_total = configured,
            "in-flight RUPD uses {} shards but ACCOUNT_SHARDS = {}; \
             the in-flight RUPD will continue with {} (the persisted total)",
            progress.total,
            configured,
            progress.total,
        );
    }
    if progress.committed < progress.total {
        tracing::warn!(
            epoch = epoch.number,
            next_shard = progress.committed,
            total_shards = progress.total,
            "crash detected mid-RUPD: rupd_progress is set. \
             On the next RUPD trigger, dolos will resume the pipeline."
        );
    }
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 349 - 448, Add the same
crash-recovery logging for rupd_progress that exists for ewrap_progress and
estart_progress: after loading epoch (load_epoch::<D> and the epoch variable),
check epoch.rupd_progress.as_ref(), compare progress.total against
crate::shard::ACCOUNT_SHARDS and emit a tracing::warn with epoch, stored_total,
configured_total when they differ, and if progress.committed < progress.total
emit a tracing::warn including next_shard and total_shards warning that a crash
was detected mid-RUPD; mirror the style and fields used in the existing
ewrap/estart blocks (use progress, progress.total, progress.committed, and
epoch.number).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/cardano/src/ewrap/work_unit.rs`:
- Around line 71-84: The code in initialize uses load_epoch::<D>(domain.state())
and silently treats any Err as ACCOUNT_SHARDS, breaking the “persisted shard
count is authoritative” guarantee; change the Err(_) arm to propagate the error
instead of falling back. Replace the match so load_epoch errors are returned
(e.g., use let epoch = load_epoch::<D>(domain.state())?; or map_err into
DomainError and return Err) and then set self.total_shards from epoch. Keep the
existing Ok(epoch) branch logic (use epoch.ewrap_progress.as_ref().map(|p|
p.total).unwrap_or(ACCOUNT_SHARDS)) but do not convert load_epoch failures into
ACCOUNT_SHARDS.

In `@crates/cardano/src/genesis/mod.rs`:
- Around line 80-81: Fix the typo in the comment that currently reads "Ewrap
(which now runs before Ewrap)" in the seed `end` comment block in
genesis/mod.rs; update the text to reference the correct visitor (e.g.,
"AccountShard (which now runs before Ewrap)") so the comment correctly explains
that per-account shards run before the global Ewrap visitor, leaving the rest of
the sentence unchanged.

In `@crates/cardano/src/rupd/work_unit.rs`:
- Around line 380-391: Add an inline comment next to the hardcoded live_pledge:
0 in the StakeLog construction explaining why it's set to zero in this RUPD path
(e.g., "live_pledge is not computed during RUPD updates; computed in rewards
module" or note it's a known limitation), so future readers understand this is
intentional; update the comment adjacent to the StakeLog instantiation (where
live_pledge is assigned) and reference the rewards computation location or TODO
if it should be implemented later.

In `@crates/cardano/src/shard.rs`:
- Around line 46-49: The const assert using the modulo check should use the
const-friendly method: replace the predicate `PREFIX_SPACE % ACCOUNT_SHARDS ==
0` with `PREFIX_SPACE.is_multiple_of(ACCOUNT_SHARDS)` in the const assert that
references ACCOUNT_SHARDS and PREFIX_SPACE (the const block with the assert at
the top of shard.rs); update the assert to read that ACCOUNT_SHARDS >= 1 &&
PREFIX_SPACE.is_multiple_of(ACCOUNT_SHARDS) to silence the clippy
manual_is_multiple_of warning and keep semantics identical.

---

Duplicate comments:
In `@crates/cardano/src/estart/work_unit.rs`:
- Around line 93-106: The initialize() logic must resume from
estart_progress.committed and must not silently mask load_epoch failures;
instead of unconditionally falling back to ACCOUNT_SHARDS on Err(_), propagate
the error and when load_epoch succeeds, set self.total_shards from
epoch.estart_progress.as_ref().map(|p| p.total).unwrap_or(ACCOUNT_SHARDS) and
also set the worker's resume point from
epoch.estart_progress.as_ref().and_then(|p| p.committed) (e.g., assign to
whatever field tracks the next shard to run), so shards restart from the
committed position; remove the Err(_) => ACCOUNT_SHARDS branch and return the
DomainError on read failure.

In `@crates/cardano/src/shard.rs`:
- Around line 76-79: The debug-only checks in shard_key_ranges are unsafe;
replace the debug_assert! calls with real runtime validation: call
validate_total_shards(total_shards) and return/propagate an error on failure (or
use assert! if you prefer panicking) and explicitly check shard_index <
total_shards, returning Err for invalid inputs; ensure callers that read
persisted ShardProgress.total handle the Result. Update shard_key_ranges (and
its callers) to use the new Result-returning signature or use assert! so that
division by PREFIX_SPACE / total_shards and partitioning logic cannot run with
invalid total_shards.

In `@crates/core/src/work_unit.rs`:
- Around line 60-69: The trait currently allows total_shards() to return 0 which
skips all shard phases; fix by making the contract non-zero: change the return
type of total_shards() to std::num::NonZeroU32 (update implementations to cache
a NonZeroU32 in initialize()), or if you prefer not to change the trait
signature, add an explicit runtime validation in the runner/executor (after
calling initialize() and reading total_shards()) that checks shards != 0 and
returns an Err or panics with a clear message; refer to the total_shards()
method and initialize() call sites in the executor/runner that loop over shards
to add the check.

---

Nitpick comments:
In `@crates/cardano/src/lib.rs`:
- Around line 349-448: Add the same crash-recovery logging for rupd_progress
that exists for ewrap_progress and estart_progress: after loading epoch
(load_epoch::<D> and the epoch variable), check epoch.rupd_progress.as_ref(),
compare progress.total against crate::shard::ACCOUNT_SHARDS and emit a
tracing::warn with epoch, stored_total, configured_total when they differ, and
if progress.committed < progress.total emit a tracing::warn including next_shard
and total_shards warning that a crash was detected mid-RUPD; mirror the style
and fields used in the existing ewrap/estart blocks (use progress,
progress.total, progress.committed, and epoch.number).

In `@crates/cardano/src/model/epochs.rs`:
- Around line 780-783: Replace the fragile expect() on entity.end with a
defensive get_or_insert_with to ensure EndStats is initialized if missing;
specifically, change the code path in the epochs handling where entity.end is
accessed (currently using .end.as_mut().expect("ESTART seeded EpochState.end
before shards run")) to call .end.get_or_insert_with(EndStats::default) or
equivalent so the EndStats default is guaranteed without panicking, leaving the
rest of the logic in the EpochTransition/epochs handling unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c85cdd15-e7d9-42c1-8e53-7e98f8da91d2

📥 Commits

Reviewing files that changed from the base of the PR and between 62031ea and a36ccf6.

📒 Files selected for processing (14)
  • crates/cardano/src/estart/work_unit.rs
  • crates/cardano/src/ewrap/work_unit.rs
  • crates/cardano/src/genesis/mod.rs
  • crates/cardano/src/lib.rs
  • crates/cardano/src/model/epochs.rs
  • crates/cardano/src/model/mod.rs
  • crates/cardano/src/rewards/mod.rs
  • crates/cardano/src/rupd/loading.rs
  • crates/cardano/src/rupd/mod.rs
  • crates/cardano/src/rupd/work_unit.rs
  • crates/cardano/src/shard.rs
  • crates/cardano/work_units.md
  • crates/core/src/sync.rs
  • crates/core/src/work_unit.rs

Comment thread crates/cardano/src/ewrap/work_unit.rs Outdated
Comment thread crates/cardano/src/genesis/mod.rs Outdated
Comment on lines +380 to +391
let log = StakeLog {
blocks_minted,
total_stake: pool_stake,
relative_size,
live_pledge: 0,
declared_pledge,
delegators_count: share.delegators_count,
total_rewards: share.total_rewards,
operator_share: share.operator_share,
fixed_cost,
margin_cost,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if live_pledge is populated elsewhere or if this is expected
rg -n "live_pledge" crates/cardano/src/rupd/ -A 2 -B 2
rg -n "live_pledge" crates/cardano/src/model/logs.rs -A 2 -B 2

Repository: txpipe/dolos

Length of output: 574


🏁 Script executed:

# Search for all StakeLog creations to understand if this is the only place
rg -n "StakeLog\s*\{" --type rust -A 15

Repository: txpipe/dolos

Length of output: 3390


🏁 Script executed:

# Check for TODO/FIXME comments related to pledge or StakeLog
rg -n "(TODO|FIXME).*?(pledge|live|stake)" crates/cardano/src/ -i

Repository: txpipe/dolos

Length of output: 38


🏁 Script executed:

# Check if StakeLog is used elsewhere and how live_pledge is typically populated
rg -n "live_pledge" crates/cardano/src/ -B 3 -A 1 --type rust

Repository: txpipe/dolos

Length of output: 5777


Add a comment explaining why live_pledge is hardcoded to 0.

The StakeLog entry sets live_pledge: 0 in the RUPD context. While live_pledge is computed elsewhere in the codebase (in the rewards module), RUPD appears to intentionally omit this computation. Add an inline comment to clarify whether this is by design (e.g., "live pledge is not computed in RUPD updates") or if it's a known limitation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/rupd/work_unit.rs` around lines 380 - 391, Add an inline
comment next to the hardcoded live_pledge: 0 in the StakeLog construction
explaining why it's set to zero in this RUPD path (e.g., "live_pledge is not
computed during RUPD updates; computed in rewards module" or note it's a known
limitation), so future readers understand this is intentional; update the
comment adjacent to the StakeLog instantiation (where live_pledge is assigned)
and reference the rewards computation location or TODO if it should be
implemented later.

Comment thread crates/cardano/src/shard.rs
Bincode encodes CardanoDelta variants by positional index and structs by
positional fields, so this branch's three new variants (EWrapProgress,
EStartProgress, RupdProgress) inserted mid-enum and the appended `prev_*_progress`
fields on EpochWrapUp / EpochTransition would have made pre-upgrade WAL
rows undecodable.

Freezes CardanoDelta indices 0..=38 to match pre-PR `main`, restores the
legacy struct shapes verbatim under `#[deprecated]`, and introduces
EpochWrapUpV2 / EpochTransitionV2 carrying the new undo state. The three
sharded-progress variants plus the V2 variants are appended at the end
of the enum. New commit paths in estart/reset.rs and ewrap/wrapup.rs
emit V2; the legacy types remain solely for replay of older WAL rows
and carry TODO(wal-compat) cleanup notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
crates/cardano/src/model/epochs.rs (2)

1490-1497: Remove unused #[allow(deprecated)] attribute.

Static analysis correctly identifies this attribute as unused. Outer attributes on prop_compose! macro invocations don't propagate into the macro expansion. The test function at line 1660 that calls any_epoch_wrap_up() already has its own #[allow(deprecated)], which is where the suppression actually takes effect.

🧹 Suggested fix
-    #[allow(deprecated)]
     prop_compose! {
         fn any_epoch_wrap_up()(
             stats in any_end_stats(),
         ) -> EpochWrapUp {
             EpochWrapUp::new(stats)
         }
     }

As per coding guidelines: "Run cargo clippy --workspace --all-targets --all-features and resolve all clippy warnings before committing changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/model/epochs.rs` around lines 1490 - 1497, Remove the
unused outer attribute by deleting the #[allow(deprecated)] placed immediately
above the prop_compose! invocation that defines any_epoch_wrap_up; the
suppression does not propagate into the macro expansion and the test calling
any_epoch_wrap_up already has its own #[allow(deprecated)], so keep that test
attribute and remove the one before prop_compose! (symbols to locate:
prop_compose!, any_epoch_wrap_up, any_end_stats, EpochWrapUp::new).

1603-1613: Remove unused #[allow(deprecated)] attribute.

Same issue as line 1490 — this outer attribute on the macro invocation is ineffective. The test function at line 1685 already has the necessary #[allow(deprecated)].

🧹 Suggested fix
-    #[allow(deprecated)]
     prop_compose! {
         fn any_epoch_transition()(
             new_epoch in root::any_epoch(),
         ) -> EpochTransition {
             // new_pots is filled in by the test harness from the entity's initial_pots
             // so that `new_pots.max_supply() == entity.initial_pots.max_supply()` holds
             // (which `apply`'s debug_assert requires).
             EpochTransition::new(new_epoch, crate::pots::Pots::default(), None, None)
         }
     }

As per coding guidelines: "Run cargo clippy --workspace --all-targets --all-features and resolve all clippy warnings before committing changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/model/epochs.rs` around lines 1603 - 1613, Remove the
ineffective outer attribute #[allow(deprecated)] applied to the prop_compose!
macro invocation that defines any_epoch_transition; simply delete that attribute
so the macro block starts with prop_compose! { ... } and leave the rest
(new_epoch in root::any_epoch(), EpochTransition::new(...)) unchanged — the test
function already carries the needed #[allow(deprecated)] where necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/cardano/src/model/epochs.rs`:
- Around line 1490-1497: Remove the unused outer attribute by deleting the
#[allow(deprecated)] placed immediately above the prop_compose! invocation that
defines any_epoch_wrap_up; the suppression does not propagate into the macro
expansion and the test calling any_epoch_wrap_up already has its own
#[allow(deprecated)], so keep that test attribute and remove the one before
prop_compose! (symbols to locate: prop_compose!, any_epoch_wrap_up,
any_end_stats, EpochWrapUp::new).
- Around line 1603-1613: Remove the ineffective outer attribute
#[allow(deprecated)] applied to the prop_compose! macro invocation that defines
any_epoch_transition; simply delete that attribute so the macro block starts
with prop_compose! { ... } and leave the rest (new_epoch in root::any_epoch(),
EpochTransition::new(...)) unchanged — the test function already carries the
needed #[allow(deprecated)] where necessary.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 31123132-1f23-4705-8ad1-c58958f3b406

📥 Commits

Reviewing files that changed from the base of the PR and between a36ccf6 and 3923753.

📒 Files selected for processing (4)
  • crates/cardano/src/estart/reset.rs
  • crates/cardano/src/ewrap/wrapup.rs
  • crates/cardano/src/model/epochs.rs
  • crates/cardano/src/model/mod.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/cardano/src/estart/reset.rs
  • crates/cardano/src/model/mod.rs
  • crates/cardano/src/ewrap/wrapup.rs

scarmuega and others added 6 commits April 28, 2026 16:28
Previously, `EwrapWorkUnit` / `EstartWorkUnit` / `RupdWorkUnit::initialize`
read only `*_progress.total` from persisted state and the core lifecycle
loop iterated `0..total_shards` unconditionally, so a crash mid-boundary
replayed shards `0..k-1` on restart. That is unsafe: `AccountTransition`
deltas (Estart) are non-idempotent — replaying a committed shard would
double-rotate every account in it.

Adds `WorkUnit::start_shard()` (defaulting to `0`); `run_lifecycle` now
loops `start_shard..total_shards`. The three sharded work units cache
the committed cursor in `initialize` from `*_progress.committed`, and
their `start_shard()` returns it. `CardanoWorkUnit` delegates the new
method to its variants. The bootstrap test harness mirrors the real
runner.

Also propagates `load_epoch` failures via `?` instead of the previous
`Err(_) => ACCOUNT_SHARDS` swallow, so a state-read failure can no
longer silently repartition an in-flight boundary.

Addresses PR #978 review comments 3153861491 (restart cursor) and
3154574387 (propagate load_epoch failures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lize

`commit_finalize` was archiving `self.ending_state()` at the epoch-start
temporal key, but `stream_and_apply_namespace::<D, EpochState>` only
applied the boundary-closing deltas (PParamsUpdate, TreasuryWithdrawal,
EpochWrapUp) to the writer — `ending_state` itself was never refreshed.
The archived row therefore carried the pre-commit snapshot (stale
rolling/pparams, populated `ewrap_progress`).

Adds an EpochState-specific variant of the streaming helper that
returns the post-apply singleton; `commit_finalize` swaps the result
into `self.ending_state` before the archive write, so the archived
EpochState matches what's about to be committed to the live state
store.

Addresses PR #978 review comment 3153861497.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`shard_key_ranges` previously guarded `total_shards` and `shard_index`
with `debug_assert!`, which compiles to nothing in release. A `0` would
divide by zero in `variant_range`, and a non-divisor of 256 would
silently produce broken partitions. Since `total_shards` can come from
persisted `ShardProgress.total` (not just the compile-time
`ACCOUNT_SHARDS` constant), the invariants must hold at runtime in all
profiles.

Promotes the validation to unconditional `assert!` / `panic!` with
informative messages.

Addresses PR #978 review comment 3153861503.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- genesis/mod.rs: fix copy/paste typo "Ewrap (which now runs before
  Ewrap)" → "per-shard Ewrap pass (which runs before the global Ewrap
  finalize)" (PR #978 comment 3154574432).
- work_units.md: add `text` language to fenced sequence diagram block
  for markdownlint MD040 (3153861508); fix stale loader name
  `BoundaryWork::load_ewrap` → `load_finalize` (3153861515).
- tests/memory.rs: sample heap allocation across full iteration as well
  as iterator construction, so a backend that buffers the shard on first
  `next()` no longer slips through (3153861530).
- model/epochs.rs: hoist `#[allow(deprecated)]` from per-item attributes
  on the prop_compose!/test items (where the macro hides the lint site)
  to the whole `prop_tests` module — silences the test-build deprecation
  warnings introduced by the V1/V2 split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- shard.rs: const-context divisibility check now uses
  `u32::is_multiple_of` (stable-const since Rust 1.87) per
  `clippy::manual_is_multiple_of`.
- model/epochs.rs: replace `let mut x = T::default(); x.f = ...; x` with
  a struct-literal `..Default::default()` form per
  `clippy::field_reassign_with_default`; also demote a doc comment
  attached to a `prop_compose!` invocation to a regular `//` comment
  (rustdoc can't attach docs to macro invocations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `crates/cardano/src/ashard/` module was merged into
`crates/cardano/src/ewrap/` (it became EWRAP's per-shard leg) and the
`EpochEndAccumulate` delta was renamed to `EWrapProgress`. The debug
guide's tables, narrative, file-path references, and instrumentation
hints still pointed at the old names and paths.

Updates the Step 3 classification table, Step 5 work-unit narrative +
source-file map, and Step 6 instrumentation table to:
- talk about EWRAP/ESTART each having a per-shard leg + finalize, not
  a separate ASHARD work unit;
- point at `ewrap/rewards.rs`, `crates/cardano/src/shard.rs`, and the
  per-unit `work_unit.rs` files;
- reference `EWrapProgress` (not `EpochEndAccumulate`) and the
  `EpochWrapUpV2` / `EpochTransitionV2` deltas where boundary close /
  open is described.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compute pool live pledge globally in load_globals so each shard sees
the full pledge sum, instead of summing only the owner accounts that
fall in the current shard's key range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@scarmuega scarmuega merged commit 86cc287 into main Apr 29, 2026
12 checks passed
@scarmuega scarmuega deleted the feat/shard-ewrap-work-units branch April 29, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant